Related papers: Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

URL: http://arxiv.org/abs/2406.14644v1
Date: Thu, 20 Jun 2024 18:07:26 GMT
Title: Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation
Authors: Chunyuan Deng, Yilun Zhao, Yuzhao Heng, Yitong Li, Jiannan Cao, Xiangru Tang, Arman Cohan,
Abstract summary: The issue of training corpus overlaps with evaluation benchmarks--referred to as contamination--has been the focus of significant recent research. This body of work aims to identify contamination, understand its impacts, and explore mitigation strategies. This survey serves as a succinct overview of the most recent advancements in data contamination research.
Score: 28.997127566200753
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Data contamination has garnered increased attention in the era of large language models (LLMs) due to the reliance on extensive internet-derived training corpora. The issue of training corpus overlap with evaluation benchmarks--referred to as contamination--has been the focus of significant recent research. This body of work aims to identify contamination, understand its impacts, and explore mitigation strategies from diverse perspectives. However, comprehensive studies that provide a clear pathway from foundational concepts to advanced insights are lacking in this nascent field. Therefore, we present a comprehensive survey in the field of data contamination, laying out the key issues, methodologies, and findings to date, and highlighting areas in need of further research and development. In particular, we begin by examining the effects of data contamination across various stages and forms. We then provide a detailed analysis of current contamination detection methods, categorizing them to highlight their focus, assumptions, strengths, and limitations. We also discuss mitigation strategies, offering a clear guide for future research. This survey serves as a succinct overview of the most recent advancements in data contamination research, providing a straightforward guide for the benefit of future research endeavors.

Related papers

Survey on AI-Generated Media Detection: From Non-MLLM to MLLM [51.91311158085973]
Methods for detecting AI-generated media have evolved rapidly. General-purpose detectors based on MLLMs integrate authenticity verification, explainability, and localization capabilities. Ethical and security considerations have emerged as critical global concerns.
arXiv Detail & Related papers (2025-02-07T12:18:20Z)
Model Inversion Attacks: A Survey of Approaches and Countermeasures [59.986922963781]
Recently, a new type of privacy attack, the model inversion attacks (MIAs), aims to extract sensitive features of private data for training. Despite the significance, there is a lack of systematic studies that provide a comprehensive overview and deeper insights into MIAs. This survey aims to summarize up-to-date MIA methods in both attacks and defenses.
arXiv Detail & Related papers (2024-11-15T08:09:28Z)
Cross-Target Stance Detection: A Survey of Techniques, Datasets, and Challenges [7.242609314791262]
Cross-target stance detection is the task of determining the viewpoint expressed in a text towards a given target. With the increasing need to analyze and mining viewpoints and opinions online, the task has recently seen a significant surge in interest. This review paper examines the advancements in cross-target stance detection over the last decade.
arXiv Detail & Related papers (2024-09-20T15:49:14Z)
Video Anomaly Detection in 10 Years: A Survey and Outlook [10.143205531474907]
Video anomaly detection (VAD) holds immense importance across diverse domains such as surveillance, healthcare, and environmental monitoring. This survey explores deep learning-based VAD, expanding beyond traditional supervised training paradigms to encompass emerging weakly supervised, self-supervised, and unsupervised approaches.
arXiv Detail & Related papers (2024-05-29T17:56:31Z)
Few-Shot Object Detection: Research Advances and Challenges [15.916463121997843]
Few-shot object detection (FSOD) combines few-shot learning and object detection techniques to rapidly adapt to novel objects with limited annotated samples. This paper presents a comprehensive survey to review the significant advancements in the field of FSOD in recent years.
arXiv Detail & Related papers (2024-04-07T03:37:29Z)
Investigating Data Contamination for Pre-training Language Models [46.335755305642564]
We explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models. We highlight the effect of both text contamination (textiti.e. input text of the evaluation samples) and ground-truth contamination (textiti.e. the prompts asked on the input and the desired outputs) from evaluation data.
arXiv Detail & Related papers (2024-01-11T17:24:49Z)
A Comprehensive Survey of Forgetting in Deep Learning Beyond Continual Learning [58.107474025048866]
Forgetting refers to the loss or deterioration of previously acquired knowledge. Forgetting is a prevalent phenomenon observed in various other research domains within deep learning.
arXiv Detail & Related papers (2023-07-16T16:27:58Z)
A Diachronic Analysis of Paradigm Shifts in NLP Research: When, How, and Why? [84.46288849132634]
We propose a systematic framework for analyzing the evolution of research topics in a scientific field using causal discovery and inference techniques. We define three variables to encompass diverse facets of the evolution of research topics within NLP. We utilize a causal discovery algorithm to unveil the causal connections among these variables using observational data.
arXiv Detail & Related papers (2023-05-22T11:08:00Z)
Recent Few-Shot Object Detection Algorithms: A Survey with Performance Comparison [54.357707168883024]
Few-Shot Object Detection (FSOD) mimics the humans' ability of learning to learn. FSOD intelligently transfers the learned generic object knowledge from the common heavy-tailed, to the novel long-tailed object classes. We give an overview of FSOD, including the problem definition, common datasets, and evaluation protocols.
arXiv Detail & Related papers (2022-03-27T04:11:28Z)
A Comparative Review of Recent Few-Shot Object Detection Algorithms [0.0]
Few-shot object detection, learning to adapt to the novel classes with a few labeled data, is an imperative and long-lasting problem. Recent studies have explored how to use implicit cues in extra datasets without target-domain supervision to help few-shot detectors refine robust task notions.
arXiv Detail & Related papers (2021-10-30T07:57:11Z)
A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution Detection: Solutions and Future Challenges [28.104112546546936]
Machine learning models often encounter samples that are diverged from the training distribution. Despite having similar and shared concepts, out-of-distribution, open-set, and anomaly detection have been investigated independently. This survey aims to provide a cross-domain and comprehensive review of numerous eminent works in respective areas.
arXiv Detail & Related papers (2021-10-26T22:05:31Z)
Anomalous Example Detection in Deep Learning: A Survey [98.2295889723002]
This survey tries to provide a structured and comprehensive overview of the research on anomaly detection for Deep Learning applications. We provide a taxonomy for existing techniques based on their underlying assumptions and adopted approaches. We highlight the unsolved research challenges while applying anomaly detection techniques in DL systems and present some high-impact future research directions.
arXiv Detail & Related papers (2020-03-16T02:47:23Z)
Survey of Network Intrusion Detection Methods from the Perspective of the Knowledge Discovery in Databases Process [63.75363908696257]
We review the methods that have been applied to network data with the purpose of developing an intrusion detector. We discuss the techniques used for the capture, preparation and transformation of the data, as well as, the data mining and evaluation methods. As a result of this literature review, we investigate some open issues which will need to be considered for further research in the area of network security.
arXiv Detail & Related papers (2020-01-27T11:21:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.