Theoretical Proof that Generated Text in the Corpus Leads to the Collapse of Auto-regressive Language Models
- URL: http://arxiv.org/abs/2412.14872v2
- Date: Tue, 11 Feb 2025 12:25:11 GMT
- Title: Theoretical Proof that Generated Text in the Corpus Leads to the Collapse of Auto-regressive Language Models
- Authors: Lecheng Wang, Xianjie Shi, Ge Li, Jia Li, Xuanming Zhang, Yihong Dong, Wenpin Jiao, Hong Mei,
- Abstract summary: This paper presents theoretical proof that once a corpus (such as the World Wide Web) begins to incorporate generated text, LM collapse is bound to occur.
We express our concerns about the current situation in which an increasing amount of generated text may be used in LM training.
- Score: 26.117724170912552
- License:
- Abstract: Auto-regressive language models (LMs) have been widely used to generate text on the World Wide Web. The generated text is often collected into the training corpus of the next generations of LMs. Previous work experimentally found that LMs collapse when trained on recursively generated text. This paper presents theoretical proof that once a corpus (such as the World Wide Web) begins to incorporate generated text, and the training text of each LM is sampled from this corpus, then no matter how small the amount of text generated by each LM that enters the corpus is, after a sufficient amount of time, LM collapse is bound to occur. Our proof is validated by a series of experiments showing that the collapsed LMs perform no better than an untrained LM with randomly initialized parameters. By proving the existence of LM collapse, we express our concerns about the current situation in which an increasing amount of generated text may be used in LM training. The source code is available in the online data warehouse: https://github.com/wanglc02/generated-data
Related papers
- Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG [57.14250086701313]
We investigate the extent to which modern LMs generate $n$-grams from their training data.
We develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data.
arXiv Detail & Related papers (2024-06-18T21:31:19Z) - Nearest Neighbor Speculative Decoding for LLM Generation and Attribution [87.3259169631789]
Nearest Speculative Decoding (NEST) is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources.
NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks.
In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
arXiv Detail & Related papers (2024-05-29T17:55:03Z) - Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability [58.582216812183496]
Language models (LMs) can sometimes generate factually correct text and estimate truth values of individual claims.
Current LMs generate incorrect or nonsensical content, and are difficult to edit and bring up to date.
We present a method called Deductive Closure Training (DCT) that uses LMs themselves to identify implications of (and contradictions within) the text that they generate.
arXiv Detail & Related papers (2024-01-16T18:58:37Z) - Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval [51.437420003471615]
We propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch.
RPT improves retrieval quality and subsequently perplexity across the board compared to strong baselines.
arXiv Detail & Related papers (2023-06-23T10:18:02Z) - LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback.
Our focus is the code generation task, where the model produces code based on natural language instructions.
LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z) - Factuality Enhanced Language Models for Open-Ended Text Generation [60.27166549575472]
We design the FactualityPrompts test set and metrics to measure the factuality of LM generations.
We find that larger LMs are more factual than smaller ones, although a previous study suggests that larger LMs can be less truthful in terms of misconceptions.
We propose a factuality-enhanced training method that uses TopicPrefix for better awareness of facts and sentence completion.
arXiv Detail & Related papers (2022-06-09T17:16:43Z) - Fingerprinting Fine-tuned Language Models in the Wild [6.7034293304862755]
We study the problem of large-scale fingerprinting of fine-tuned LMs in the wild.
Our results show that fine-tuning itself is the most effective in attributing the synthetic text generated by fine-tuned LMs.
arXiv Detail & Related papers (2021-06-03T09:07:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.