Related papers: Correcting FLORES Evaluation Dataset for Four African Languages

Correcting FLORES Evaluation Dataset for Four African Languages

URL: http://arxiv.org/abs/2409.00626v2
Date: Sat, 5 Oct 2024 19:02:31 GMT
Title: Correcting FLORES Evaluation Dataset for Four African Languages
Authors: Idris Abdulmumin, Sthembiso Mkhwanazi, Mahlatse S. Mbooi, Shamsuddeen Hassan Muhammad, Ibrahim Said Ahmad, Neo Putini, Miehleketo Mathebula, Matimba Shingange, Tajuddeen Gwadabe, Vukosi Marivate,
Abstract summary: The original dataset, though groundbreaking in its coverage of low-resource languages, exhibited various inconsistencies and inaccuracies. Through a meticulous review process by native speakers, several corrections were identified and implemented. We believe that our corrections improve the linguistic accuracy and reliability of the data.
Score: 2.552967468434151
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper describes the corrections made to the FLORES evaluation (dev and devtest) dataset for four African languages, namely Hausa, Northern Sotho (Sepedi), Xitsonga, and isiZulu. The original dataset, though groundbreaking in its coverage of low-resource languages, exhibited various inconsistencies and inaccuracies in the reviewed languages that could potentially hinder the integrity of the evaluation of downstream tasks in natural language processing (NLP), especially machine translation. Through a meticulous review process by native speakers, several corrections were identified and implemented, improving the overall quality and reliability of the dataset. For each language, we provide a concise summary of the errors encountered and corrected and also present some statistical analysis that measures the difference between the existing and corrected datasets. We believe that our corrections improve the linguistic accuracy and reliability of the data and, thereby, contribute to a more effective evaluation of NLP tasks involving the four African languages. Finally, we recommend that future translation efforts, particularly in low-resource languages, prioritize the active involvement of native speakers at every stage of the process to ensure linguistic accuracy and cultural relevance.

Related papers

Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning [58.355275813623685]
This work looks at a critical gap in the calibration of large language models (LLMs) within multilingual settings.<n>Even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets.<n>However, improvements in accuracy are marginal or non-existent, highlighting a critical shortcoming of standard SFT for multilingual languages.
arXiv Detail & Related papers (2026-01-04T04:29:12Z)
Low-Resource English-Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks [6.177998679139308]
Despite advances in Neural Machine Translation (NMT), low-resource languages like Tigrinya remain underserved.<n>This paper investigates transfer learning techniques using multilingual pretrained models to enhance translation quality for morphologically rich, low-resource languages.
arXiv Detail & Related papers (2025-09-24T15:02:57Z)
Testing the Limits of Machine Translation from One Book [0.0]
Current state-of-the-art models demonstrate capacity to leverage in-context learning to translate into previously unseen language contexts.<n>We focus on Kanuri, a language that, despite having substantial speaker population, has minimal digital resources.
arXiv Detail & Related papers (2025-08-08T19:27:44Z)
Natural language processing for African languages [7.884789325654572]
dissertation focuses on languages spoken in Sub-Saharan Africa where all the indigenous languages can be regarded as low-resourced.<n>We show that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data.<n>We develop large scale human-annotated labelled datasets for 21 African languages in two impactful NLP tasks.
arXiv Detail & Related papers (2025-06-30T22:26:36Z)
Information Loss in LLMs' Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family [0.9422186097220215]
This study systematically investigates how training data, language proximity, and language family affect information loss in multilingual translation.<n>We evaluate two large language models, GPT-4 and Llama 2, by performing round-trip translations.
arXiv Detail & Related papers (2025-06-29T17:21:05Z)
Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation [38.81102126876936]
This paper introduces a novel retrieval-based method that enhances translation quality for low-resource languages by focusing on key terms. To evaluate the effectiveness of this method, we conducted experiments translating from English into three low-resource languages: Cherokee, a critically endangered indigenous language of North America; Tibetan, a historically and culturally significant language in Asia; and Manchu, a language with few remaining speakers. Our comparison with the zero-shot performance of GPT-4o and LLaMA 3.1 405B, highlights the significant challenges these models face when translating into low-resource languages.
arXiv Detail & Related papers (2024-11-18T05:41:27Z)
A Comparative Study of Translation Bias and Accuracy in Multilingual Large Language Models for Cross-Language Claim Verification [1.566834021297545]
This study systematically evaluates translation bias and the effectiveness of Large Language Models for cross-lingual claim verification. We investigate two distinct translation methods: pre-translation and self-translation. Our findings reveal that low-resource languages exhibit significantly lower accuracy in direct inference due to underrepresentation.
arXiv Detail & Related papers (2024-10-14T09:02:42Z)
A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross-Lingual Sentence Representations [0.4499833362998489]
This study focuses on the case of English-Marathi language pairs, where existing datasets are notably noisy. To mitigate the impact of data quality issues, we propose a data filtering approach based on cross-lingual sentence representations. Results demonstrate a significant improvement in translation quality over the baseline post-filtering with IndicSBERT.
arXiv Detail & Related papers (2024-09-04T13:49:45Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Large language models effectively leverage document-level context for literary translation, but critical errors persist [32.54546652197316]
Large language models (LLMs) are competitive with the state of the art on a wide range of sentence-level translation datasets. We show through a rigorous human evaluation that asking the Gpt-3.5 (text-davinci-003) LLM to translate an entire literary paragraph results in higher-quality translations.
arXiv Detail & Related papers (2023-04-06T17:27:45Z)
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development. We create the largest human-annotated NER dataset for 20 African languages. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z)
No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z)
On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice. By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data. We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z)
TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint) It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis. TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.