WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects
- URL: http://arxiv.org/abs/2502.12404v1
- Date: Tue, 18 Feb 2025 00:39:30 GMT
- Title: WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects
- Authors: Daniel Deutsch, Eleftheria Briakou, Isaac Caswell, Mara Finkelstein, Rebecca Galor, Juraj Juraska, Geza Kovacs, Alison Lui, Ricardo Rei, Jason Riesa, Shruti Rijhwani, Parker Riley, Elizabeth Salesky, Firas Trabelsi, Stephanie Winkler, Biao Zhang, Markus Freitag,
- Abstract summary: We extend the WMT24 dataset to cover 55 languages by collecting new human-written references and post-edits for 46 new languages and dialects.
The dataset covers four domains: literary, news, social, and speech.
We benchmark a variety of MT providers and LLMs on the collected dataset using automatic metrics and find that LLMs are the best-performing MT systems in all 55 languages.
- Score: 41.35634985044016
- License:
- Abstract: As large language models (LLM) become more and more capable in languages other than English, it is important to collect benchmark datasets in order to evaluate their multilingual performance, including on tasks like machine translation (MT). In this work, we extend the WMT24 dataset to cover 55 languages by collecting new human-written references and post-edits for 46 new languages and dialects in addition to post-edits of the references in 8 out of 9 languages in the original WMT24 dataset. The dataset covers four domains: literary, news, social, and speech. We benchmark a variety of MT providers and LLMs on the collected dataset using automatic metrics and find that LLMs are the best-performing MT systems in all 55 languages. These results should be confirmed using a human-based evaluation, which we leave for future work.
Related papers
- Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study [13.409987421121405]
GemmaX2-28 is a 9B model achieving top-tier multilingual translation performance across 28 languages.
GemmaX2-28 consistently outperforms the state-of-the-art (SOTA) models such as TowerInstruct and XALMA.
arXiv Detail & Related papers (2025-02-04T16:57:03Z) - AFRIDOC-MT: Document-level MT Corpus for African Languages [24.871863004002616]
AFRIDOC-MT is a document-level multi-parallel translation dataset covering English and five African languages.
The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages.
arXiv Detail & Related papers (2025-01-10T22:49:29Z) - Tagengo: A Multilingual Chat Dataset [3.8073142980733]
We present a high quality dataset of more than 70k prompt-response pairs in 74 languages.
We use this dataset to train a state-of-the-art open source English LLM to chat multilingually.
arXiv Detail & Related papers (2024-05-21T09:06:36Z) - TaCo: Enhancing Cross-Lingual Transfer for Low-Resource Languages in LLMs through Translation-Assisted Chain-of-Thought Processes [9.254047358707014]
We introduce the Multilingual Instruction-Tuning dataset (MITS), comprised of Alpaca-52K, Dolly-15K, and Vicuna Benchmark translations into 132 languages.
Secondly, we propose a new method called emphTaCo: Translation-Assisted Cross-Linguality, which utilizes translations in a chain-of-thought process to instruction-tune LLMs on new languages through a curriculum-learning process.
Our results indicate that the TaCo method impresses GPT-4 with an 82% score for a low-resource language in the Vicuna Benchmark dataset, doubling the performance in contrast to instruction tuning
arXiv Detail & Related papers (2023-11-17T06:55:32Z) - ChatGPT MT: Competitive for High- (but not Low-) Resource Languages [62.178282377729566]
Large language models (LLMs) implicitly learn to perform a range of language tasks, including machine translation (MT)
We present the first experimental evidence for an expansive set of 204 languages, along with MT cost analysis.
Our analysis reveals that a language's resource level is the most important feature in determining ChatGPT's relative ability to translate it.
arXiv Detail & Related papers (2023-09-14T04:36:00Z) - MADLAD-400: A Multilingual And Document-Level Large Audited Dataset [66.12330208082442]
We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl.
We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process.
arXiv Detail & Related papers (2023-09-09T02:34:01Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages.
In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.