Data Contamination Report from the 2024 CONDA Shared Task
- URL: http://arxiv.org/abs/2407.21530v2
- Date: Sun, 4 Aug 2024 05:53:25 GMT
- Title: Data Contamination Report from the 2024 CONDA Shared Task
- Authors: Oscar Sainz, Iker GarcĂa-Ferrero, Alon Jacovi, Jon Ander Campos, Yanai Elazar, Eneko Agirre, Yoav Goldberg, Wei-Lin Chen, Jenny Chim, Leshem Choshen, Luca D'Amico-Wong, Melissa Dell, Run-Ze Fan, Shahriar Golchin, Yucheng Li, Pengfei Liu, Bhavish Pahwa, Ameya Prabhu, Suryansh Sharma, Emily Silcock, Kateryna Solonko, David Stap, Mihai Surdeanu, Yu-Min Tseng, Vishaal Udandarao, Zengzhi Wang, Ruijie Xu, Jinglin Yang,
- Abstract summary: This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors.
The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources.
- Score: 78.50743680642405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community.
Related papers
- TextMine: Data, Evaluation Framework and Ontology-guided LLM Pipeline for Humanitarian Mine Action [4.990484801014005]
Humanitarian Mine Action (HMA) addresses the challenge of detecting and removing landmines from conflict regions.<n>Much of the life-saving operational knowledge produced by HMA agencies is buried in unstructured reports.<n>To address this issue, we propose TextMine: the first dataset, evaluation framework and ontology-guided large language model (LLM) pipeline.
arXiv Detail & Related papers (2025-09-18T15:55:19Z) - A Global Dataset of Location Data Integrity-Assessed Reforestation Efforts [40.17692290400862]
This study presents a dataset on global afforestation and reforestation efforts compiled from primary (meta-)information.<n>Our dataset covers 1,289,068 planting sites from 45,628 projects spanning 33 years.<n>Approximately 79% of the georeferenced planting sites monitored fail on at least 1 out of 10 LDIS indicators.
arXiv Detail & Related papers (2025-08-15T09:28:31Z) - Conformal Data Contamination Tests for Trading or Sharing of Data [28.020738753027043]
The amount of quality data in many machine learning tasks is limited to what is available locally to data owners.<n>We propose a distribution-free, contamination-aware data-sharing framework that identifies external data agents whose data is most valuable for model personalization.
arXiv Detail & Related papers (2025-07-18T11:44:42Z) - Learning Dense Hand Contact Estimation from Imbalanced Data [51.54990464786128]
There are two major challenges for learning dense hand contact estimation.<n>First, there exists class imbalance issue from hand contact datasets where majority of samples are not in contact.<n>Second, hand contact datasets contain spatial imbalance issue with most of hand contact exhibited in finger tips.
arXiv Detail & Related papers (2025-05-16T11:54:25Z) - Multi-Platform Aggregated Dataset of Online Communities (MADOC) [64.45797970830233]
MADOC aggregates and standardizes data from Bluesky, Koo, Reddit, and Voat (2012-2024), containing 18.9 million posts, 236 million comments, and 23.1 million unique users.
The dataset enables comparative studies of toxic behavior evolution across platforms through standardized interaction records and sentiment analysis.
arXiv Detail & Related papers (2025-01-22T14:02:11Z) - Mutual Information Multinomial Estimation [53.58005108981247]
Estimating mutual information (MI) is a fundamental yet challenging task in data science and machine learning.
Our main discovery is that a preliminary estimate of the data distribution can dramatically help estimate.
Experiments on diverse tasks including non-Gaussian synthetic problems with known ground-truth and real-world applications demonstrate the advantages of our method.
arXiv Detail & Related papers (2024-08-18T06:27:30Z) - A Taxonomy for Data Contamination in Large Language Models [12.643103231497813]
A growing concern is data contamination, where evaluation datasets may be contained in the pretraining corpus.
Decontamination, the process of detecting and removing such data, is a potential solution.
How different types of contamination impact the performance of language models on downstream tasks is not fully understood.
arXiv Detail & Related papers (2024-07-11T17:50:34Z) - How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library [68.10605098856087]
Large Language Models (LLMs) are increasingly being used in business applications and fundraising in AI.
LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data.
We release an open-source Python library named LLMSanitize implementing major contamination detection algorithms.
arXiv Detail & Related papers (2024-03-31T14:32:02Z) - Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models [42.958880063727996]
CDD stands for Contamination Detection via output Distribution for LLMs.
To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution.
arXiv Detail & Related papers (2024-02-24T23:54:41Z) - A Framework for Scalable Ambient Air Pollution Concentration Estimation [0.0]
Ambient air pollution remains a critical issue in the United Kingdom, where data on air pollution concentrations form the foundation for interventions aimed at improving air quality.
We introduce a data-driven supervised machine learning model framework designed to address temporal and spatial data gaps by filling missing measurements.
This approach provides a comprehensive dataset for England throughout 2018 at a 1kmx1km hourly resolution.
arXiv Detail & Related papers (2024-01-16T18:03:07Z) - NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination
for each Benchmark [19.875954121100005]
We argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble.
The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark.
This position paper defines different levels of data contamination and argues for a community effort.
arXiv Detail & Related papers (2023-10-27T09:48:29Z) - Federated Causal Discovery [74.37739054932733]
This paper develops a gradient-based learning framework named DAG-Shared Federated Causal Discovery (DS-FCD)
It can learn the causal graph without directly touching local data and naturally handle the data heterogeneity.
Extensive experiments on both synthetic and real-world datasets verify the efficacy of the proposed method.
arXiv Detail & Related papers (2021-12-07T08:04:12Z) - Combining Data-driven Supervision with Human-in-the-loop Feedback for
Entity Resolution [47.90125404360125]
We build a model that identifies and consolidates data points that represent the same person.
In this case study, we discuss our human-in-the-loop enabled, data-centric solution to closing the training-production performance divergence.
arXiv Detail & Related papers (2021-11-20T02:22:12Z) - Measuring Data Collection Diligence for Community Healthcare [23.612133021992868]
Non-diligent data collection by community health workers (CHWs) is a significant challenge in developing countries.
In this work, we define and test a data collection diligence score.
Our framework has been validated on the ground using observations by the field monitors of our partner NGO in India.
arXiv Detail & Related papers (2020-11-05T16:45:03Z) - Trust and Transparency in Contact Tracing Applications [81.07729301514182]
The global outbreak of COVID-19 has led to efforts to manage and mitigate the continued spread of the disease.
One of these efforts include the use of contact tracing to identify people who are at-risk of developing the disease through exposure to an infected person.
There has been significant interest in the development and use of digital contact tracing solutions to supplement the work of human contact tracers.
The collection and use of sensitive personal details by these applications has led to a number of concerns by the stakeholder groups with a vested interest in these solutions.
arXiv Detail & Related papers (2020-06-19T20:29:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.