Conformal Data Contamination Tests for Trading or Sharing of Data
- URL: http://arxiv.org/abs/2507.13835v1
- Date: Fri, 18 Jul 2025 11:44:42 GMT
- Title: Conformal Data Contamination Tests for Trading or Sharing of Data
- Authors: Martin V. Vejling, Shashi Raj Pandey, Christophe A. N. Biscio, Petar Popovski,
- Abstract summary: The amount of quality data in many machine learning tasks is limited to what is available locally to data owners.<n>We propose a distribution-free, contamination-aware data-sharing framework that identifies external data agents whose data is most valuable for model personalization.
- Score: 28.020738753027043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The amount of quality data in many machine learning tasks is limited to what is available locally to data owners. The set of quality data can be expanded through trading or sharing with external data agents. However, data buyers need quality guarantees before purchasing, as external data may be contaminated or irrelevant to their specific learning task. Previous works primarily rely on distributional assumptions about data from different agents, relegating quality checks to post-hoc steps involving costly data valuation procedures. We propose a distribution-free, contamination-aware data-sharing framework that identifies external data agents whose data is most valuable for model personalization. To achieve this, we introduce novel two-sample testing procedures, grounded in rigorous theoretical foundations for conformal outlier detection, to determine whether an agent's data exceeds a contamination threshold. The proposed tests, termed conformal data contamination tests, remain valid under arbitrary contamination levels while enabling false discovery rate control via the Benjamini-Hochberg procedure. Empirical evaluations across diverse collaborative learning scenarios demonstrate the robustness and effectiveness of our approach. Overall, the conformal data contamination test distinguishes itself as a generic procedure for aggregating data with statistically rigorous quality guarantees.
Related papers
- Unlocking Post-hoc Dataset Inference with Synthetic Data [11.886166976507711]
Training datasets are often scraped from the internet without respecting data owners' intellectual property rights.<n>Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training.<n>Existing DI methods require a private set-known to be absent from training-that closely matches the compromised dataset's distribution.<n>In this work, we address this challenge by synthetically generating the required held-out set.
arXiv Detail & Related papers (2025-06-18T08:46:59Z) - Private, Augmentation-Robust and Task-Agnostic Data Valuation Approach for Data Marketplace [56.78396861508909]
PriArTa is an approach for computing the distance between the distribution of the buyer's existing dataset and the seller's dataset.
PriArTa is communication-efficient, enabling the buyer to evaluate datasets without needing access to the entire dataset from each seller.
arXiv Detail & Related papers (2024-11-01T17:13:14Z) - Data Distribution Valuation [56.71023681599737]
Existing data valuation methods define a value for a discrete dataset.
In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled.
We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
arXiv Detail & Related papers (2024-10-06T07:56:53Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.<n>Our method is able to work under gray-box conditions without access to model training data or weights.<n>We evaluate the degree of data leakage of 35 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Stop Uploading Test Data in Plain Text: Practical Strategies for
Mitigating Data Contamination by Evaluation Benchmarks [70.39633252935445]
Data contamination has become prevalent and challenging with the rise of models pretrained on large automatically-crawled corpora.
For closed models, the training data becomes a trade secret, and even for open models, it is not trivial to detect contamination.
We propose three strategies that can make a difference: (1) Test data made public should be encrypted with a public key and licensed to disallow derivative distribution; (2) demand training exclusion controls from closed API holders, and protect your test data by refusing to evaluate without them; and (3) avoid data which appears with its solution on the internet, and release the web-page context of internet-derived
arXiv Detail & Related papers (2023-05-17T12:23:38Z) - Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models.
It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation.
We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z) - Fundamentals of Task-Agnostic Data Valuation [21.78555506720078]
We study valuing the data of a data owner/seller for a data seeker/buyer.
We focus on task-agnostic data valuation without any validation requirements.
arXiv Detail & Related papers (2022-08-25T22:07:07Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Generating Higher-Fidelity Synthetic Datasets with Privacy Guarantees [34.01962235805095]
We consider the problem of enhancing user privacy in common machine learning development tasks, such as data annotation and inspection.
We propose employing Bayesian differential privacy as the means to achieve a rigorous theoretical guarantee while providing a better privacy-utility trade-off.
arXiv Detail & Related papers (2020-03-02T16:23:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.