Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora
- URL: http://arxiv.org/abs/2601.14994v1
- Date: Wed, 21 Jan 2026 13:53:04 GMT
- Title: Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora
- Authors: Chaymaa Abbas, Nour Shamaa, Mariette Awad,
- Abstract summary: We investigate contamination dynamics in multilingual settings by fine-tuning several open-weight Large Language Models.<n>We show that translation into Arabic suppresses conventional contamination indicators, yet models still benefit from exposure to contaminated data.<n>We propose Translation-Aware Contamination Detection, which identifies contamination by comparing signals across multiple translated benchmark variants.
- Score: 0.3288086999241324
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data contamination undermines the validity of Large Language Model evaluation by enabling models to rely on memorized benchmark content rather than true generalization. While prior work has proposed contamination detection methods, these approaches are largely limited to English benchmarks, leaving multilingual contamination poorly understood. In this work, we investigate contamination dynamics in multilingual settings by fine-tuning several open-weight LLMs on varying proportions of Arabic datasets and evaluating them on original English benchmarks. To detect memorization, we extend the Tested Slot Guessing method with a choice-reordering strategy and incorporate Min-K% probability analysis, capturing both behavioral and distributional contamination signals. Our results show that translation into Arabic suppresses conventional contamination indicators, yet models still benefit from exposure to contaminated data, particularly those with stronger Arabic capabilities. This effect is consistently reflected in rising Mink% scores and increased cross-lingual answer consistency as contamination levels grow. To address this blind spot, we propose Translation-Aware Contamination Detection, which identifies contamination by comparing signals across multiple translated benchmark variants rather than English alone. The Translation-Aware Contamination Detection reliably exposes contamination even when English-only methods fail. Together, our findings highlight the need for multilingual, translation-aware evaluation pipelines to ensure fair, transparent, and reproducible assessment of LLMs.
Related papers
- When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation [12.89127380889145]
Large language models (LLMs) can be benchmark-contaminated, resulting in inflated scores that mask memorization as generalization.<n>We show that machine translation contamination can be cross-directional, artificially boosting performance in unseen translation directions due to target-side perturbation.
arXiv Detail & Related papers (2026-01-28T18:56:21Z) - Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning [58.355275813623685]
This work looks at a critical gap in the calibration of large language models (LLMs) within multilingual settings.<n>Even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets.<n>However, improvements in accuracy are marginal or non-existent, highlighting a critical shortcoming of standard SFT for multilingual languages.
arXiv Detail & Related papers (2026-01-04T04:29:12Z) - Contamination Detection for VLMs using Multi-Modal Semantic Perturbation [73.76465227729818]
Open-source Vision-Language Models (VLMs) have achieved state-of-the-art performance on benchmark tasks.<n>Pretraining corpora raise a critical concern for both practitioners and users: inflated performance due to test-set leakage.<n>We show that existing detection approaches either fail outright or exhibit inconsistent behavior.<n>We propose a novel simple yet effective detection method based on multi-modal semantic perturbation.
arXiv Detail & Related papers (2025-11-05T18:59:52Z) - Translate, then Detect: Leveraging Machine Translation for Cross-Lingual Toxicity Classification [35.35733615199578]
We compare translation-based and language-specific/multilingual classification pipelines.<n>We find translation benefits strongly correlated with both the resource level of the target language and the quality of the machine translation system.
arXiv Detail & Related papers (2025-09-17T23:58:07Z) - Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM [53.05486269607166]
multimodal large language models (MLLMs) have significantly enhanced performance across benchmarks.<n>Existing detection methods for unimodal large language models (LLMs) are inadequate for MLLMs due to multimodal data complexity and multi-phase training.<n>We analyze multimodal data contamination using our analytical framework, MM-Detect, which defines two contamination categories-unimodal and cross-modal.
arXiv Detail & Related papers (2024-11-06T10:44:15Z) - Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method [108.56493934296687]
We introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection.<n>We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text.
arXiv Detail & Related papers (2024-09-23T07:55:35Z) - Data Contamination Can Cross Language Barriers [29.103517721155487]
The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data.
We first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods.
We propose generalization-based approaches to unmask such deeply concealed contamination.
arXiv Detail & Related papers (2024-06-19T05:53:27Z) - A Comprehensive Survey of Contamination Detection Methods in Large Language Models [68.10605098856087]
With the rise of Large Language Models (LLMs) in recent years, abundant new opportunities are emerging, but also new challenges.<n>LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data.<n>This limitation jeopardizes real capability improvement in the field of NLP, yet there remains a lack of methods on how to efficiently detect contamination.
arXiv Detail & Related papers (2024-03-31T14:32:02Z) - Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models [42.958880063727996]
CDD stands for Contamination Detection via output Distribution for LLMs.
To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution.
arXiv Detail & Related papers (2024-02-24T23:54:41Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - Vicinal Risk Minimization for Few-Shot Cross-lingual Transfer in Abusive
Language Detection [19.399281609371258]
Cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results.
We resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection.
arXiv Detail & Related papers (2023-11-03T16:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.