The Economics of AI Training Data: A Research Agenda
- URL: http://arxiv.org/abs/2510.24990v1
- Date: Tue, 28 Oct 2025 21:37:35 GMT
- Title: The Economics of AI Training Data: A Research Agenda
- Authors: Hamidah Oderinwale, Anna Kazlauskas,
- Abstract summary: Despite data's central role in AI production, it remains the least understood input.<n>As AI labs exhaust public data and turn to proprietary sources, research across computer science, economics, law, and policy has fragmented.<n>We establish data economics as a coherent field through three contributions.
- Score: 0.4174557458129457
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Despite data's central role in AI production, it remains the least understood input. As AI labs exhaust public data and turn to proprietary sources, with deals reaching hundreds of millions of dollars, research across computer science, economics, law, and policy has fragmented. We establish data economics as a coherent field through three contributions. First, we characterize data's distinctive properties -- nonrivalry, context dependence, and emergent rivalry through contamination -- and trace historical precedents for market formation in commodities such as oil and grain. Second, we present systematic documentation of AI training data deals from 2020 to 2025, revealing persistent market fragmentation, five distinct pricing mechanisms (from per-unit licensing to commissioning), and that most deals exclude original creators from compensation. Third, we propose a formal hierarchy of exchangeable data units (token, record, dataset, corpus, stream) and argue for data's explicit representation in production functions. Building on these foundations, we outline four open research problems foundational to data economics: measuring context-dependent value, balancing governance with privacy, estimating data's contribution to production, and designing mechanisms for heterogeneous, compositional goods.
Related papers
- A Sustainable AI Economy Needs Data Deals That Work for Generators [56.949279542190084]
We argue that the machine learning value chain is structurally unsustainable due to an economic data processing inequality.<n>We analyze 73 public data deals and show that the majority of value accrues to aggregators.<n>We propose an Equitable Data-Value Exchange Framework to enable a minimal market that benefits all participants.
arXiv Detail & Related papers (2026-01-15T01:05:48Z) - OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value [74.80873109856563]
OpenDataArena (ODA) is a holistic and open platform designed to benchmark the intrinsic value of post-training data.<n>ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; and (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources.
arXiv Detail & Related papers (2025-12-16T03:33:24Z) - An Instrumental Value for Data Production and its Application to Data Pricing [107.98697414652479]
This paper develops an approach for capturing the instrumental value of data production processes.<n>We show how they connect to classic notions of information design and signals in information economics.
arXiv Detail & Related papers (2024-12-24T03:53:57Z) - Wasserstein Markets for Differentially-Private Data [1.4266656344673316]
Data markets provide a means to enable wider access as well as determine the appropriate privacy-utility trade-off.<n>Existing data market frameworks either require a trusted third party to perform expensive valuations or are unable to capture the nature of data value.<n>This paper proposes a valuation mechanism based on the Wasserstein distance for differentially-private data, and corresponding procurement mechanisms.
arXiv Detail & Related papers (2024-12-03T17:40:26Z) - A Novel Framework for Analyzing Structural Transformation in Data-Constrained Economies Using Bayesian Modeling and Machine Learning [0.0]
The shift from agrarian economies to more diversified industrial and service-based systems is a key driver of economic development.
In low- and middle-income countries (LMICs), data scarcity and unreliability hinder accurate assessments of this process.
This paper presents a novel statistical framework designed to address these challenges by integrating Bayesian hierarchical modeling, machine learning-based data imputation, and factor analysis.
arXiv Detail & Related papers (2024-09-25T08:39:41Z) - Navigating the Data Trading Crossroads: An Interdisciplinary Survey [33.64953318642493]
Data has been increasingly recognized as a critical factor in the future economy.
However, constructing an efficient data trading market faces challenges such as privacy breaches, data monopolies, and misuse.
This paper aims to identify existing problems, research gaps, and propose potential solutions.
arXiv Detail & Related papers (2024-07-16T08:07:16Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.<n>Existing approaches require re-training models on different data subsets, which is computationally intensive.<n>This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - A Survey of Data Pricing for Data Marketplaces [77.3189288320768]
This paper attempts to comprehensively review the state-of-the-art on existing data pricing studies.
Our key contribution lies in a new taxonomy of data pricing studies that unifies different attributes determining data prices.
arXiv Detail & Related papers (2023-03-07T04:35:56Z) - Data Sharing Markets [95.13209326119153]
We study a setup where each agent can be both buyer and seller of data.
We consider two cases: bilateral data exchange (trading data with data) and unilateral data exchange (trading data with money)
arXiv Detail & Related papers (2021-07-19T06:00:34Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - A Survey on Data Pricing: from Economics to Data Science [61.72030615854597]
We examine various motivations behind data pricing and understand the economics of data pricing.
We discuss both digital products and data products.
We consider a series of challenges and directions for future work.
arXiv Detail & Related papers (2020-09-09T19:31:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.