Position: The Most Expensive Part of an LLM should be its Training Data
- URL: http://arxiv.org/abs/2504.12427v1
- Date: Wed, 16 Apr 2025 18:56:14 GMT
- Title: Position: The Most Expensive Part of an LLM should be its Training Data
- Authors: Nikhil Kandpal, Colin Raffel,
- Abstract summary: Training a Large Language Model (LLM) is an increasingly expensive endeavor due to growing computational, hardware, energy, and engineering demands.<n>Yet, an often-overlooked (and seldom paid) expense is the human labor behind these models' training data.<n>This position paper aims to assign a monetary value to this labor and argues that the most expensive part of producing an LLM should be the compensation provided to training data producers.
- Score: 38.3722794045587
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training a state-of-the-art Large Language Model (LLM) is an increasingly expensive endeavor due to growing computational, hardware, energy, and engineering demands. Yet, an often-overlooked (and seldom paid) expense is the human labor behind these models' training data. Every LLM is built on an unfathomable amount of human effort: trillions of carefully written words sourced from books, academic papers, codebases, social media, and more. This position paper aims to assign a monetary value to this labor and argues that the most expensive part of producing an LLM should be the compensation provided to training data producers for their work. To support this position, we study 64 LLMs released between 2016 and 2024, estimating what it would cost to pay people to produce their training datasets from scratch. Even under highly conservative estimates of wage rates, the costs of these models' training datasets are 10-1000 times larger than the costs to train the models themselves, representing a significant financial liability for LLM providers. In the face of the massive gap between the value of training data and the lack of compensation for its creation, we highlight and discuss research directions that could enable fairer practices in the future.
Related papers
- Sustainable Carbon-Aware and Water-Efficient LLM Scheduling in Geo-Distributed Cloud Datacenters [2.391483506190989]
Large Language Models (LLM) such as ChatGPT, CoPilot, and Gemini have been widely adopted in different areas.<n>Recent studies estimate that the costs of operating LLMs in their inference phase can exceed training costs by 25x per year.<n>We propose a novel framework called SLIT to co-optimize LLM quality of service (time-to-first token), carbon emissions, water usage, and energy costs.
arXiv Detail & Related papers (2025-05-29T15:31:28Z) - LLM360 K2: Building a 65B 360-Open-Source Large Language Model from Scratch [77.02136168850532]
We detail the training of the LLM360 K2-65B model, scaling up our 360-degree OPEN SOURCE approach to the largest and most powerful models under project LLM360.
arXiv Detail & Related papers (2025-01-13T08:26:43Z) - Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud [12.651588927599441]
We present a family of data augmentation models designed to significantly improve the efficiency for model fine-tuning.
These models, trained based on sufficiently small LLMs, support key functionalities with low inference costs.
Experiments and an application study prove the effectiveness of our approach.
arXiv Detail & Related papers (2024-12-06T09:04:12Z) - Investigating Cost-Efficiency of LLM-Generated Training Data for Conversational Semantic Frame Analysis [18.44272589315175]
We show how to balance the trade-off between the higher quality but more expensive human data and the lower quality yet substantially cheaper LLM-generated data.
Our experiments, conducted across various budget levels, reveal that optimal cost-efficiency is achieved by combining both human and LLM-generated data.
arXiv Detail & Related papers (2024-10-09T05:15:13Z) - Beyond Next Token Prediction: Patch-Level Training for Large Language Models [69.67438563485887]
We introduce patch-level training for Large Language Models (LLMs)<n>During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.<n>We show that patch-level training can reduce the overall training costs to 0.5$times$, without compromising the model performance.
arXiv Detail & Related papers (2024-07-17T15:48:39Z) - The Future of Large Language Model Pre-training is Federated [15.237418036900582]
We propose a scalable deployment system called Photon to enable the investigation and development of this new training paradigm for LLM pre-training.
We show that Photon can be used by organizations interested in collaborating with their private data sources and computational resources for pre-training LLMs with billions of parameters.
We further show the effectiveness of the federated training scales with model size and present our approach for training billion-scale federated LLMs using limited resources.
arXiv Detail & Related papers (2024-05-17T15:27:52Z) - Continual Learning for Large Language Models: A Survey [95.79977915131145]
Large language models (LLMs) are not amenable to frequent re-training, due to high training costs arising from their massive scale.
This paper surveys recent works on continual learning for LLMs.
arXiv Detail & Related papers (2024-02-02T12:34:09Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - FLM-101B: An Open LLM and How to Train It with $100K Budget [63.244403881531035]
We show that FLM-101B, trained with our growth strategy under a budget of $100K, reaches 80% of the baselines' performances with only 10% of their floating-point operations.<n>We believe that further studies on progressive training will benefit the community by cutting down the costs and promoting green AI.
arXiv Detail & Related papers (2023-09-07T17:07:36Z) - Considerations for health care institutions training large language
models on electronic health records [7.048517095805301]
Large language models (LLMs) like ChatGPT have excited scientists across fields.
In medicine, one source of excitement is the potential applications of LLMs trained on electronic health record ( EHR) data.
But there are tough questions we must first answer if health care institutions are interested in having LLMs trained on their own data.
arXiv Detail & Related papers (2023-08-24T00:09:01Z) - Aligning Large Language Models with Human: A Survey [53.6014921995006]
Large Language Models (LLMs) trained on extensive textual corpora have emerged as leading solutions for a broad array of Natural Language Processing (NLP) tasks.
Despite their notable performance, these models are prone to certain limitations such as misunderstanding human instructions, generating potentially biased content, or factually incorrect information.
This survey presents a comprehensive overview of these alignment technologies, including the following aspects.
arXiv Detail & Related papers (2023-07-24T17:44:58Z) - Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use
Large Language Models for Text Production Tasks [12.723777984461693]
Large language models (LLMs) are remarkable data annotators.
Crowdsourcing, an important, inexpensive way to obtain human annotations, may itself be impacted by LLMs.
We estimate that 33-46% of crowd workers used LLMs when completing a task.
arXiv Detail & Related papers (2023-06-13T16:46:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.