AFRIDOC-MT: Document-level MT Corpus for African Languages
- URL: http://arxiv.org/abs/2501.06374v1
- Date: Fri, 10 Jan 2025 22:49:29 GMT
- Title: AFRIDOC-MT: Document-level MT Corpus for African Languages
- Authors: Jesujoba O. Alabi, Israel Abebe Azime, Miaoran Zhang, Cristina EspaƱa-Bonet, Rachel Bawden, Dawei Zhu, David Ifeoluwa Adelani, Clement Oyeleke Odoje, Idris Akinade, Iffat Maab, Davis David, Shamsuddeen Hassan Muhammad, Neo Putini, David O. Ademuyiwa, Andrew Caines, Dietrich Klakow,
- Abstract summary: AFRIDOC-MT is a document-level multi-parallel translation dataset covering English and five African languages.
The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages.
- Score: 24.871863004002616
- License:
- Abstract: This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yor\`ub\'a, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine translation (NMT) models and large language models (LLMs) for translations between English and these languages, at both the sentence and pseudo-document levels. These outputs are realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieved the best average performance among the standard NMT models, while GPT-4o outperformed general-purpose LLMs. Fine-tuning selected models led to substantial performance gains, but models trained on sentences struggled to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, repetition of words or phrases, and off-target translations, especially for African languages.
Related papers
- Retrieval-Augmented Machine Translation with Unstructured Knowledge [74.84236945680503]
Retrieval-augmented generation (RAG) introduces additional information to enhance large language models (LLMs)
In machine translation (MT), previous work typically retrieves in-context examples from paired MT corpora, or domain-specific knowledge from knowledge graphs.
In this paper, we study retrieval-augmented MT using unstructured documents.
arXiv Detail & Related papers (2024-12-05T17:00:32Z) - Instruction-Tuned LLMs Succeed in Document-Level MT Without Fine-Tuning -- But BLEU Turns a Blind Eye [15.987448306012167]
Large language models (LLMs) have excelled in various NLP tasks, including machine translation (MT)
This work investigates the inherent capability of instruction-tuned LLMs for document-level translation (docMT)
arXiv Detail & Related papers (2024-10-28T11:49:58Z) - Salute the Classic: Revisiting Challenges of Machine Translation in the
Age of Large Language Models [91.6543868677356]
The evolution of Neural Machine Translation has been influenced by six core challenges.
These challenges include domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search.
This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models.
arXiv Detail & Related papers (2024-01-16T13:30:09Z) - Enhancing Document-level Translation of Large Language Model via
Translation Mixed-instructions [24.025242477280983]
Existing large language models (LLMs) for machine translation are typically fine-tuned on sentence-level translation instructions.
This challenge arises from the issue of sentence-level coverage, where subsequent sentences in the document remain untranslated.
We propose an approach that combines sentence-level and document-level translation instructions of varying lengths to fine-tune LLMs.
arXiv Detail & Related papers (2024-01-16T03:28:26Z) - Adapting Large Language Models for Document-Level Machine Translation [46.370862171452444]
Large language models (LLMs) have significantly advanced various natural language processing (NLP) tasks.
Recent research indicates that moderately-sized LLMs often outperform larger ones after task-specific fine-tuning.
This study focuses on adapting LLMs for document-level machine translation (DocMT) for specific language pairs.
arXiv Detail & Related papers (2024-01-12T09:29:13Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Document-Level Machine Translation with Large Language Models [91.03359121149595]
Large language models (LLMs) can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks.
This paper provides an in-depth evaluation of LLMs' ability on discourse modeling.
arXiv Detail & Related papers (2023-04-05T03:49:06Z) - DOCmT5: Document-Level Pretraining of Multilingual Language Models [9.072507490639218]
We introduce DOCmT5, a multilingual sequence-to-sequence language model pre-trained with large scale parallel documents.
We propose a simple and effective pre-training objective - Document Reordering Machine Translation.
DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks.
arXiv Detail & Related papers (2021-12-16T08:58:52Z) - Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context.
We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.