Related papers: A Cramér-von Mises Approach to Incentivizing Truthful Data Sharing

A Cramér-von Mises Approach to Incentivizing Truthful Data Sharing

URL: http://arxiv.org/abs/2506.07272v1
Date: Sun, 08 Jun 2025 20:14:48 GMT
Title: A Cramér-von Mises Approach to Incentivizing Truthful Data Sharing
Authors: Alex Clinton, Thomas Zeng, Yiding Chen, Xiaojin Zhu, Kirthevasan Kandasamy,
Abstract summary: We develop reward mechanisms based on a novel, two-sample test inspired by the Cram'er-von Mises statistic.<n>Our methods strictly incentivize agents to submit more genuine data, while disincentivizing data fabrication and other types of untruthful reporting.
Score: 10.731682970668142
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern data marketplaces and data sharing consortia increasingly rely on incentive mechanisms to encourage agents to contribute data. However, schemes that reward agents based on the quantity of submitted data are vulnerable to manipulation, as agents may submit fabricated or low-quality data to inflate their rewards. Prior work has proposed comparing each agent's data against others' to promote honesty: when others contribute genuine data, the best way to minimize discrepancy is to do the same. Yet prior implementations of this idea rely on very strong assumptions about the data distribution (e.g. Gaussian), limiting their applicability. In this work, we develop reward mechanisms based on a novel, two-sample test inspired by the Cram\'er-von Mises statistic. Our methods strictly incentivize agents to submit more genuine data, while disincentivizing data fabrication and other types of untruthful reporting. We establish that truthful reporting constitutes a (possibly approximate) Nash equilibrium in both Bayesian and prior-agnostic settings. We theoretically instantiate our method in three canonical data sharing problems and show that it relaxes key assumptions made by prior work. Empirically, we demonstrate that our mechanism incentivizes truthful data sharing via simulations and on real-world language and image data.

Related papers

Scaling laws for learning with real and surrogate data [12.617392961074096]
We study a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training.<n>$(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution.<n>$(ii)$ In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM.
arXiv Detail & Related papers (2024-02-06T20:30:19Z)
Membership Inference Attacks against Synthetic Data through Overfitting Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution. We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z)
On Comparing Fair Classifiers under Data Bias [42.43344286660331]
We study the effect of varying data biases on the accuracy and fairness of fair classifiers. Our experiments show how to integrate a measure of data bias risk in the existing fairness dashboards for real-world deployments.
arXiv Detail & Related papers (2023-02-12T13:04:46Z)
Mechanisms that Incentivize Data Sharing in Federated Learning [90.74337749137432]
We show how a naive scheme leads to catastrophic levels of free-riding where the benefits of data sharing are completely eroded. We then introduce accuracy shaping based mechanisms to maximize the amount of data generated by each agent.
arXiv Detail & Related papers (2022-07-10T22:36:52Z)
How to Leverage Unlabeled Data in Offline Reinforcement Learning [125.72601809192365]
offline reinforcement learning (RL) can learn control policies from static datasets but, like standard RL methods, it requires reward annotations for every transition. One natural solution is to learn a reward function from the labeled data and use it to label the unlabeled data. We find that, perhaps surprisingly, a much simpler method that simply applies zero rewards to unlabeled data leads to effective data sharing.
arXiv Detail & Related papers (2022-02-03T18:04:54Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Incentivizing Collaboration in Machine Learning via Synthetic Data Rewards [26.850070556844628]
This paper presents a novel collaborative generative modeling (CGM) framework that incentivizes collaboration among self-interested parties to contribute data. Distributing synthetic data as rewards offers task- and model-agnostic benefits for downstream learning tasks.
arXiv Detail & Related papers (2021-12-17T05:15:30Z)
Data Sharing Markets [95.13209326119153]
We study a setup where each agent can be both buyer and seller of data. We consider two cases: bilateral data exchange (trading data with data) and unilateral data exchange (trading data with money)
arXiv Detail & Related papers (2021-07-19T06:00:34Z)
Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process. We generate a representative as well as fair version of the UCI Adult census data set. We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z)
Fair Densities via Boosting the Sufficient Statistics of Exponential Families [72.34223801798422]
We introduce a boosting algorithm to pre-process data for fairness. Our approach shifts towards better data fitting while still ensuring a minimal fairness guarantee. Empirical results are present to display the quality of result on real-world data.
arXiv Detail & Related papers (2020-12-01T00:49:17Z)
ASCII: ASsisted Classification with Ignorance Interchange [17.413989127493622]
We propose a method named ASCII for an agent to improve its classification performance through assistance from other agents. The main idea is to iteratively interchange an ignorance value between 0 and 1 for each collated sample among agents. The method is naturally suitable for privacy-aware, transmission-economical, and decentralized learning scenarios.
arXiv Detail & Related papers (2020-10-21T03:57:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.