RN-F: A Novel Approach for Mitigating Contaminated Data in Large Language Models
- URL: http://arxiv.org/abs/2505.13249v1
- Date: Mon, 19 May 2025 15:32:49 GMT
- Title: RN-F: A Novel Approach for Mitigating Contaminated Data in Large Language Models
- Authors: Le Vu Anh, Dinh Duc Nha Nguyen, Phi Long Nguyen,
- Abstract summary: Residual-Noise Fingerprinting (RN-F) is a novel framework for detecting contaminated data in Large Language Models (LLMs)<n>RN-F is a single-pass, gradient-free detection method that leverages residual signal patterns without introducing additional floating-point operations.<n>We show that RN-F consistently outperforms existing state-of-the-art methods, achieving performance improvements of up to 10.5% in contamination detection metrics.
- Score: 0.8739101659113157
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have become foundational in modern artificial intelligence, powering a wide range of applications from code generation and virtual assistants to scientific research and enterprise automation. However, concerns about data contamination--where test data overlaps with training data--have raised serious questions about the reliability of these applications. Despite awareness of this issue, existing methods fall short in effectively identifying or mitigating contamination. In this paper, we propose Residual-Noise Fingerprinting (RN-F), a novel framework for detecting contaminated data in LLMs. RN-F is a single-pass, gradient-free detection method that leverages residual signal patterns without introducing additional floating-point operations. Our approach is lightweight, model-agnostic, and efficient. We evaluate RN-F on multiple LLMs across various contaminated datasets and show that it consistently outperforms existing state-of-the-art methods, achieving performance improvements of up to 10.5% in contamination detection metrics.
Related papers
- A Survey on Data Contamination for Large Language Models [12.431575579432458]
Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis.<n>The reliability of performance evaluation has come under scrutiny due to data contamination.
arXiv Detail & Related papers (2025-02-20T10:23:27Z) - Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges [3.0455427910850785]
We evaluate five contamination detection approaches with four state-of-the-art LLMs across eight challenging datasets.<n>Our analysis reveals that current methods have non-trivial limitations in their assumptions and practical applications.
arXiv Detail & Related papers (2024-09-16T02:04:33Z) - Anomaly Detection of Tabular Data Using LLMs [54.470648484612866]
We show that pre-trained large language models (LLMs) are zero-shot batch-level anomaly detectors.
We propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies.
arXiv Detail & Related papers (2024-06-24T04:17:03Z) - A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper proposes a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods.<n>The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics.<n>We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z) - A Comprehensive Survey of Contamination Detection Methods in Large Language Models [68.10605098856087]
With the rise of Large Language Models (LLMs) in recent years, abundant new opportunities are emerging, but also new challenges.<n>LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data.<n>This limitation jeopardizes real capability improvement in the field of NLP, yet there remains a lack of methods on how to efficiently detect contamination.
arXiv Detail & Related papers (2024-03-31T14:32:02Z) - Federated Learning with Anomaly Detection via Gradient and Reconstruction Analysis [2.28438857884398]
We introduce a novel framework that synergizes gradient-based analysis with autoencoder-driven data reconstruction to detect and mitigate poisoned data with unprecedented precision.
Our method outperforms existing solutions by 15% in anomaly detection accuracy while maintaining a minimal false positive rate.
Our work paves the way for future advancements in distributed learning security.
arXiv Detail & Related papers (2024-03-15T03:54:45Z) - Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data.
For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - DOCTOR: A Multi-Disease Detection Continual Learning Framework Based on Wearable Medical Sensors [3.088223994180069]
We propose DOCTOR, a multi-disease detection continual learning framework based on wearable medical sensors (WMSs)
It employs a multi-headed deep neural network (DNN) and a replay-style CL algorithm.
It achieves 1.43 times better average test accuracy, 1.25 times better F1-score, and 0.41 higher backward transfer than the naive fine-tuning framework.
arXiv Detail & Related papers (2023-05-09T19:33:17Z) - SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier
Detection [63.253850875265115]
Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples.
We propose a modular acceleration system, called SUOD, to address it.
arXiv Detail & Related papers (2020-03-11T00:22:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.