Fugu-MT 論文翻訳(概要): Language Generation with Replay: A Learning-Theoretic View of Model Collapse

論文の概要: Language Generation with Replay: A Learning-Theoretic View of Model Collapse

arxiv url: http://arxiv.org/abs/2603.11784v1
Date: Thu, 12 Mar 2026 10:44:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.022959
Title: Language Generation with Replay: A Learning-Theoretic View of Model Collapse
Title（参考訳）: リプレイによる言語生成:モデル崩壊の学習論的視点
Authors: Giorgio Racca, Michal Valko, Amartya Sanyal,
Abstract要約: 本稿では,言語生成の理論レンズによるモデル崩壊問題について検討する。我々の主な貢献は、リプレイが基本的に生成を制限するときのきめ細かい学習理論的特徴である。
参考スコア（独自算出の注目度）: 27.157142191029024
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As scaling laws push the training of frontier large language models (LLMs) toward ever-growing data requirements, training pipelines are approaching a regime where much of the publicly available online text may be consumed. At the same time, widespread LLM usage increases the volume of machine-generated content on the web; together, these trends raise the likelihood of generated text re-entering future training corpora, increasing the associated risk of performance degradation often called model collapse. In practice, model developers address this concern through data cleaning, watermarking, synthetic-data policies, or, in some cases, blissful ignorance. However, the problem of model collapse in generative models has not been examined from a learning-theoretic perspective: we study it through the theoretical lens of the language generation in the limit framework, introducing a replay adversary that augments the example stream with the generator's own past outputs. Our main contribution is a fine-grained learning-theoretic characterization of when replay fundamentally limits generation: while replay is benign for the strongest notion of uniform generation, it provably creates separations for the weaker notions of non-uniform generation and generation in the limit. Interestingly, our positive results mirror heuristics widely used in practice, such as data cleaning, watermarking, and output filtering, while our separations show when these ideas can fail.
Abstract（参考訳）: スケーリング法がフロンティアの大規模言語モデル(LLM)のトレーニングを、成長を続けるデータ要件へと押し上げる中、トレーニングパイプラインは、公開されているオンラインテキストの多くを消費する体制に近づいている。同時に、LLMの利用は、Web上でのマシン生成コンテンツの量を増大させ、これらの傾向は、将来のトレーニングコーパスに再参入するテキスト生成の可能性を高め、しばしばモデル崩壊と呼ばれるパフォーマンス劣化のリスクを増大させる。実際には、モデル開発者はデータクリーニング、透かし、合成データポリシー、場合によっては無知によってこの問題に対処します。しかし, 生成モデルにおけるモデル崩壊の問題は, 学習理論の観点からは検討されていない。我々は, 限界フレームワークにおける言語生成の理論レンズを通してこれを研究し, ジェネレータの過去の出力でサンプルストリームを増大させるリプレイ逆数を導入している。我々の主な貢献は、リプレイが一様生成の最も強い概念に対して、リプレイが良心的である一方で、非一様生成の弱い概念と、その限界における生成の分離を確実に生成するときの、きめ細かい学習理論的特徴である。興味深いことに、私たちのポジティブな結果は、データクリーニング、透かし、出力フィルタリングなど、実際に広く使われているヒューリスティックスを反映しています。

論文の概要: Language Generation with Replay: A Learning-Theoretic View of Model Collapse

関連論文リスト