Related papers: Data Contamination Can Cross Language Barriers

Data Contamination Can Cross Language Barriers

URL: http://arxiv.org/abs/2406.13236v1
Date: Wed, 19 Jun 2024 05:53:27 GMT
Title: Data Contamination Can Cross Language Barriers
Authors: Feng Yao, Yufan Zhuang, Zihao Sun, Sunan Xu, Animesh Kumar, Jingbo Shang,
Abstract summary: The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. We first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods. We propose generalization-based approaches to unmask such deeply concealed contamination.
Score: 29.103517721155487
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be \emph{not even wrong}, as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs' working mechanisms and in post-training LLMs for enhanced multilingual capabilities. The code and dataset we use can be obtained from \url{https://github.com/ShangDataLab/Deep-Contam}.

Related papers

Robustness of LLMs to Perturbations in Text [2.0670689746336]
Large language models (LLMs) have shown impressive performance, but can they handle the inevitable noise in real-world data? This work tackles this critical question by investigating LLMs' resilience against morphological variations in text. Our findings show that contrary to popular beliefs, generative LLMs are quiet robust to noisy perturbations in text.
arXiv Detail & Related papers (2024-07-12T04:50:17Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts. We find that Llama Instruct and Mistral models exhibit high degrees of language confusion. We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
Paying More Attention to Source Context: Mitigating Unfaithful Translations from Large Language Model [28.288949710191158]
Large language models (LLMs) have showcased impressive multilingual machine translation ability. Unlike encoder-decoder style models, decoder-only LLMs lack an explicit alignment between source and target contexts. We propose to encourage LLMs to pay more attention to the source context from both source and target perspectives.
arXiv Detail & Related papers (2024-06-11T07:49:04Z)
CSS: Contrastive Semantic Similarity for Uncertainty Quantification of LLMs [1.515687944002438]
We propose Contrastive Semantic Similarity, a module to obtain similarity features for measuring uncertainty for text pairs. We conduct extensive experiments with three large language models (LLMs) on several benchmark question-answering datasets. Results show that our proposed method performs better in estimating reliable responses of LLMs than comparable baselines.
arXiv Detail & Related papers (2024-06-05T11:35:44Z)
How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library [68.10605098856087]
With the rise of Large Language Models (LLMs) in recent years, new opportunities are emerging, but also new challenges, and contamination is quickly becoming critical. Business applications and fundraising in AI have reached a scale at which a few percentage points gained on popular question-answering benchmarks could translate into dozens of millions of dollars. It is becoming harder and harder to keep track of the data that LLMs have seen; if not impossible with closed-source models like GPT-4 and Claude-3 not divulging any information on the training set.
arXiv Detail & Related papers (2024-03-31T14:32:02Z)
Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models [27.24738197172374]
Large language models have achieved remarkable performance on various code generation benchmarks. There have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data. We show that there are substantial overlap between popular code generation benchmarks and open training corpus, and models perform significantly better on the subset of the benchmarks where similar solutions are seen during training.
arXiv Detail & Related papers (2024-03-06T21:45:35Z)
LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality. We propose LLMRefine, an inference time optimization method to refine LLM's output. We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization. LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z)
CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models [12.367149496971408]
Clean-Eval mitigates the issue of data contamination and evaluates the models in a cleaner manner. Clean-Eval employs an LLM to paraphrase and back-translate the contaminated data into a candidate set. A semantic detector is then used to filter the generated low-quality samples. The best candidate is finally selected from this set based on the BLEURT score.
arXiv Detail & Related papers (2023-11-15T17:50:30Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning [79.32236399694077]
Low-quality data in the training set are usually detrimental to instruction tuning. We propose a novel method, termed "reflection-tuning" This approach utilizes an oracle LLM to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data.
arXiv Detail & Related papers (2023-10-18T05:13:47Z)
A New Benchmark and Reverse Validation Method for Passage-level Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks. We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion. We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.