Related papers: How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library

How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library

URL: http://arxiv.org/abs/2404.00699v3
Date: Tue, 20 Aug 2024 18:51:26 GMT
Title: How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library
Authors: Mathieu Ravaut, Bosheng Ding, Fangkai Jiao, Hailin Chen, Xingxuan Li, Ruochen Zhao, Chengwei Qin, Caiming Xiong, Shafiq Joty,
Abstract summary: Large Language Models (LLMs) are increasingly being used in business applications and fundraising in AI. LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data. We release an open-source Python library named LLMSanitize implementing major contamination detection algorithms.
Score: 68.10605098856087
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rise of Large Language Models (LLMs) in recent years, abundant new opportunities are emerging, but also new challenges, among which contamination is quickly becoming critical. Business applications and fundraising in AI have reached a scale at which a few percentage points gained on popular question-answering benchmarks could translate into dozens of millions of dollars, placing high pressure on model integrity. At the same time, it is becoming harder and harder to keep track of the data that LLMs have seen; if not impossible with closed-source models like GPT-4 and Claude-3 not divulging any information on the training set. As a result, contamination becomes a major issue: LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data. This limitation jeopardizes the entire progress in the field of NLP, yet, there remains a lack of methods on how to efficiently detect contamination.In this paper, we survey all recent work on contamination detection with LLMs, and help the community track contamination levels of LLMs by releasing an open-source Python library named LLMSanitize implementing major contamination detection algorithms.

Related papers

Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z)
AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge [68.39683427262335]
Existing studies fail to guarantee contamination-free evaluation as newly collected data may contain pre-existing knowledge. We propose AntiLeak-Bench, an automated anti-leakage benchmarking framework.
arXiv Detail & Related papers (2024-12-18T09:53:12Z)
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination [18.586654412992168]
multimodal large language models (MLLMs) have demonstrated superior performance on various multimodal benchmarks. The issue of data contamination during training creates challenges in performance evaluation and comparison. We introduce a multimodal data contamination detection framework, MM-Detect, designed for MLLMs.
arXiv Detail & Related papers (2024-11-06T10:44:15Z)
Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges [3.0455427910850785]
We evaluate five contamination detection approaches with four state-of-the-art LLMs across eight challenging datasets. Our analysis reveals that current methods have non-trivial limitations in their assumptions and practical applications.
arXiv Detail & Related papers (2024-09-16T02:04:33Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
Evaluation and Improvement of Fault Detection for Large Language Models [30.760472387136954]
This paper investigates the effectiveness of existing fault detection methods for large language models (LLMs) We propose textbfMuCS, a prompt textbfMutation-based prediction textbfConfidence textbfSmoothing framework to boost the fault detection capability of existing methods.
arXiv Detail & Related papers (2024-04-14T07:06:12Z)
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models [42.958880063727996]
CDD stands for Contamination Detection via output Distribution for LLMs. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution.
arXiv Detail & Related papers (2024-02-24T23:54:41Z)
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models. It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z)
Purifying Large Language Models by Ensembling a Small Language Model [39.57304668057076]
We propose a simple and easily implementable method for purifying LLMs from the negative effects caused by uncurated data. We empirically confirm the efficacy of ensembling LLMs with benign and small language models (SLMs)
arXiv Detail & Related papers (2024-02-19T14:00:39Z)
Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs [60.40396361115776]
This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model. We employ a proxy model which has far fewer parameters, and take its answers as answers. Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
arXiv Detail & Related papers (2024-02-19T11:11:08Z)
Task Contamination: Language Models May Not Be Few-Shot Anymore [9.696290050028237]
Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time.
arXiv Detail & Related papers (2023-12-26T21:17:46Z)
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans. Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z)
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed. We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
On the Risk of Misinformation Pollution with Large Language Models [127.1107824751703]
We investigate the potential misuse of modern Large Language Models (LLMs) for generating credible-sounding misinformation. Our study reveals that LLMs can act as effective misinformation generators, leading to a significant degradation in the performance of Open-Domain Question Answering (ODQA) systems.
arXiv Detail & Related papers (2023-05-23T04:10:26Z)
Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility [37.682136465784254]
We conduct over a million queries to the mainstream large language models (LLMs) including ChatGPT, LLaMA, and OPT. We find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level. We propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation.
arXiv Detail & Related papers (2023-05-15T15:44:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.