Related papers: Reclaiming the Digital Commons: A Public Data Trust for Training Data

Reclaiming the Digital Commons: A Public Data Trust for Training Data

URL: http://arxiv.org/abs/2303.09001v2
Date: Sun, 21 May 2023 23:17:19 GMT
Title: Reclaiming the Digital Commons: A Public Data Trust for Training Data
Authors: Alan Chan, Herbie Bradley, Nitarshan Rajkumar
Abstract summary: We propose that a public data trust assert control over training data for foundation models. This trust should scrape the internet as a digital commons, to license to commercial model developers for a percentage cut of revenues from deployment.
Score: 2.36052383261568
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Democratization of AI means not only that people can freely use AI, but also that people can collectively decide how AI is to be used. In particular, collective decision-making power is required to redress the negative externalities from the development of increasingly advanced AI systems, including degradation of the digital commons and unemployment from automation. The rapid pace of AI development and deployment currently leaves little room for this power. Monopolized in the hands of private corporations, the development of the most capable foundation models has proceeded largely without public input. There is currently no implemented mechanism for ensuring that the economic value generated by such models is redistributed to account for their negative externalities. The citizens that have generated the data necessary to train models do not have input on how their data are to be used. In this work, we propose that a public data trust assert control over training data for foundation models. In particular, this trust should scrape the internet as a digital commons, to license to commercial model developers for a percentage cut of revenues from deployment. First, we argue in detail for the existence of such a trust. We also discuss feasibility and potential risks. Second, we detail a number of ways for a data trust to incentivize model developers to use training data only from the trust. We propose a mix of verification mechanisms, potential regulatory action, and positive incentives. We conclude by highlighting other potential benefits of our proposed data trust and connecting our work to ongoing efforts in data and compute governance.

Related papers

Incentivizing Inclusive Contributions in Model Sharing Markets [47.66231950174746]
This paper proposes inclusive and incentivized personalized federated learning (iPFL)<n>iPFL incentivizes data holders with diverse purposes to collaboratively train personalized models without revealing raw data.<n> Empirical studies on eleven AI tasks demonstrate that iPFL consistently achieves the highest economic utility.
arXiv Detail & Related papers (2025-05-05T08:45:26Z)
Towards Data Governance of Frontier AI Models [0.0]
We look at how data can enable new governance capacities for frontier AI models. Data is non-rival, often non-excludable, easily replicable, and increasingly synthesizable. We propose a set of policy mechanisms targeting key actors along the data supply chain.
arXiv Detail & Related papers (2024-12-05T02:37:51Z)
Promoting User Data Autonomy During the Dissolution of a Monopolistic Firm [5.864623711097197]
We show how the framework of Conscious Data Contribution can enable user autonomy during under dissolution. We explore how fine-tuning and the phenomenon of "catastrophic forgetting" could actually prove beneficial as a type of machine unlearning.
arXiv Detail & Related papers (2024-11-20T18:55:51Z)
Decentralized Intelligence Network (DIN) [0.0]
Decentralized Intelligence Network (DIN) is a theoretical framework designed to address challenges in AI development. The framework supports effective AI training by allowing Participants to maintain control over their data, benefit financially, and contribute to a decentralized, scalable ecosystem.
arXiv Detail & Related papers (2024-07-02T17:40:06Z)
An Economic Solution to Copyright Challenges of Generative AI [35.37023083413299]
Generative artificial intelligence systems are trained to generate new pieces of text, images, videos, and other media. There is growing concern that such systems may infringe on the copyright interests of training data contributors. We propose a framework that compensates copyright owners proportionally to their contributions to the creation of AI-generated content.
arXiv Detail & Related papers (2024-04-22T08:10:38Z)
Trustless Audits without Revealing Data or Models [49.23322187919369]
We show that it is possible to allow model providers to keep their model weights (but not architecture) and data secret while allowing other parties to trustlessly audit model and data properties. We do this by designing a protocol called ZkAudit in which model providers publish cryptographic commitments of datasets and model weights.
arXiv Detail & Related papers (2024-04-06T04:43:06Z)
Computing Power and the Governance of Artificial Intelligence [51.967584623262674]
Governments and companies have started to leverage compute as a means to govern AI. compute-based policies and technologies have the potential to assist in these areas, but there is significant variation in their readiness for implementation. naive or poorly scoped approaches to compute governance carry significant risks in areas like privacy, economic impacts, and centralization of power.
arXiv Detail & Related papers (2024-02-13T21:10:21Z)
Mechanisms that Incentivize Data Sharing in Federated Learning [90.74337749137432]
We show how a naive scheme leads to catastrophic levels of free-riding where the benefits of data sharing are completely eroded. We then introduce accuracy shaping based mechanisms to maximize the amount of data generated by each agent.
arXiv Detail & Related papers (2022-07-10T22:36:52Z)
Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process. We generate a representative as well as fair version of the UCI Adult census data set. We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z)
Decentralized Federated Learning Preserves Model and Data Privacy [77.454688257702]
We propose a fully decentralized approach, which allows to share knowledge between trained models. Students are trained on the output of their teachers via synthetically generated input data. The results show that an untrained student model, trained on the teachers output reaches comparable F1-scores as the teacher.
arXiv Detail & Related papers (2021-02-01T14:38:54Z)
Trustworthy AI [75.99046162669997]
Brittleness to minor adversarial changes in the input data, ability to explain the decisions, address the bias in their training data, are some of the most prominent limitations. We propose the tutorial on Trustworthy AI to address six critical issues in enhancing user and public trust in AI systems.
arXiv Detail & Related papers (2020-11-02T20:04:18Z)
A Distributed Trust Framework for Privacy-Preserving Machine Learning [4.282091426377838]
This paper outlines a distributed infrastructure which is used to facilitate peer-to-peer trust between distributed agents. We detail a proof of concept using Hyperledger Aries, Decentralised Identifiers (DIDs) and Verifiable Credentials (VCs)
arXiv Detail & Related papers (2020-06-03T18:06:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.