When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation
- URL: http://arxiv.org/abs/2601.20858v1
- Date: Wed, 28 Jan 2026 18:56:21 GMT
- Title: When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation
- Authors: David Tan, Pinzhen Chen, Josef van Genabith, Koel Dutta Chowdhury,
- Abstract summary: Large language models (LLMs) can be benchmark-contaminated, resulting in inflated scores that mask memorization as generalization.<n>We show that machine translation contamination can be cross-directional, artificially boosting performance in unseen translation directions due to target-side perturbation.
- Score: 12.89127380889145
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) can be benchmark-contaminated, resulting in inflated scores that mask memorization as generalization, and in multilingual settings, this memorization can even transfer to "uncontaminated" languages. Using the FLORES-200 translation benchmark as a diagnostic, we study two 7-8B instruction-tuned multilingual LLMs: Bloomz, which was trained on FLORES, and Llama as an uncontaminated control. We confirm Bloomz's FLORES contamination and demonstrate that machine translation contamination can be cross-directional, artificially boosting performance in unseen translation directions due to target-side memorization. Further analysis shows that recall of memorized references often persists despite various source-side perturbation efforts like paraphrasing and named entity replacement. However, replacing named entities leads to a consistent decrease in BLEU, suggesting an effective probing method for memorization in contaminated models.
Related papers
- Beyond a Single Reference: Training and Evaluation with Paraphrases in Sign Language Translation [1.9102169745315323]
Most Sign Language Translation (SLT) corpora pair each signed utterance with a single written-language reference.<n>This limitation constrains both model training and evaluation.<n>We introduce BLEUpara, an extension of BLEU that evaluates translations against multiple paraphrased references.
arXiv Detail & Related papers (2026-01-29T00:02:19Z) - Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora [0.3288086999241324]
We investigate contamination dynamics in multilingual settings by fine-tuning several open-weight Large Language Models.<n>We show that translation into Arabic suppresses conventional contamination indicators, yet models still benefit from exposure to contaminated data.<n>We propose Translation-Aware Contamination Detection, which identifies contamination by comparing signals across multiple translated benchmark variants.
arXiv Detail & Related papers (2026-01-21T13:53:04Z) - Lost in Literalism: How Supervised Training Shapes Translationese in LLMs [51.04435855143767]
Large language models (LLMs) have achieved remarkable success in machine translation.<n>However, translationese, characterized by overly literal and unnatural translations, remains a persistent challenge.<n>We introduce methods to mitigate these biases, including polishing golden references and filtering unnatural training instances.
arXiv Detail & Related papers (2025-03-06T12:14:45Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.<n>But can these models relate corresponding concepts across languages, i.e., be crosslingual?<n>This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Data Contamination Can Cross Language Barriers [29.103517721155487]
The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data.
We first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods.
We propose generalization-based approaches to unmask such deeply concealed contamination.
arXiv Detail & Related papers (2024-06-19T05:53:27Z) - Comparing Hallucination Detection Metrics for Multilingual Generation [62.97224994631494]
This paper assesses how well various factual hallucination detection metrics identify hallucinations in generated biographical summaries across languages.
We compare how well automatic metrics correlate to each other and whether they agree with human judgments of factuality.
Our analysis reveals that while the lexical metrics are ineffective, NLI-based metrics perform well, correlating with human annotations in many settings and often outperforming supervised models.
arXiv Detail & Related papers (2024-02-16T08:10:34Z) - Unlikelihood Tuning on Negative Samples Amazingly Improves Zero-Shot
Translation [79.96416609433724]
Zero-shot translation (ZST) aims to translate between unseen language pairs in training data.
The common practice to guide the zero-shot language mapping during inference is to deliberately insert the source and target language IDs.
Recent studies have shown that language IDs sometimes fail to navigate the ZST task, making them suffer from the off-target problem.
arXiv Detail & Related papers (2023-09-28T17:02:36Z) - Detecting and Mitigating Hallucinations in Multilingual Summarisation [40.5267502712576]
Hallucinations pose a significant challenge to the reliability of neural models for abstractive summarisation.
We develop a novel metric, mFACT, evaluating the faithfulness of non-English summaries.
We then propose a simple but effective method to reduce hallucinations with a cross-lingual transfer.
arXiv Detail & Related papers (2023-05-23T02:59:25Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - Crosslingual Embeddings are Essential in UNMT for Distant Languages: An
English to IndoAryan Case Study [28.409618457653135]
We show that initializing the embedding layer of UNMT models with cross-lingual embeddings shows significant improvements in BLEU score over existing approaches.
We experimented using Masked Sequence to Sequence (MASS) and Denoising Autoencoder (DAE) UNMT approaches for three distant language pairs.
arXiv Detail & Related papers (2021-06-09T11:31:27Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.