Fugu-MT 論文翻訳(概要): RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

論文の概要: RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

arxiv url: http://arxiv.org/abs/2510.10681v1
Date: Sun, 12 Oct 2025 16:08:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:30.063091
Title: RePro: Training Language Models to Faithfully Recycle the Web for Pretraining
Title（参考訳）: RePro: トレーニングのためのWebを忠実にリサイクルする言語モデルをトレーニングする
Authors: Zichun Yu, Chenyan Xiong,
Abstract要約: 高品質プレトレーニングデータは、大型言語モデル(LLM)の化石燃料である ReProは、比較的小さなLMを強化学習で訓練し、事前学習データの効果的な表現を生成する新しいウェブリサイクル手法である。
参考スコア（独自算出の注目度）: 28.30636190022749
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4x larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3x. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness the fossil fuel of LLM pretraining. We open-source our code, rephraser, and recycled data at https://github.com/cxcscmu/RePro.
Abstract（参考訳）: 高品質の事前訓練データは、大型言語モデル(LLM)の化石燃料であるが、その予備費はフロンティアモデルでは低い。本稿では,比較的小さなLMを強化学習で訓練し,事前学習データの有効かつ忠実な表現を生成するウェブリサイクル手法であるReProを紹介する。具体的には、1つの品質報酬と3つの忠実報酬を設計し、LMリフレサを最適化し、その中核となる意味と構造を維持しながら、有機データを高品質なリフレッシングに変換する。実験では,DCLM-RefinedWebから採取した72Bトークンをリサイクルするために,4Bリフレサを訓練した。 400Mモデルと1.4Bモデルでの事前トレーニングの結果、ReProは22の下流タスクにおいて、オーガニックのみのベースラインよりも4.7%-14.0%の精度を達成している。 ReProはまた、70Bリフレッサーを誘導する最先端のウェブリサイクル手法であるReWireと、4倍大きなデータプールを持つ有機ベースラインを上回っている。さまざまな量のリサイクルデータによる実験では、ReProは有機データ効率を2～3倍改善している。個人および分布分析は、ReProがより重要な情報を保存し、プロンプトベースの手法と比較して、有機データの特徴を忠実に反映していることを検証する。これらの結果から,ReProはLLM予備訓練の化石燃料を効果的に活用するための効率的かつ制御可能な経路を提供することが示された。私たちは、https://github.com/cxcscmu/RePro.comで、私たちのコード、リフリーザー、リサイクルデータをオープンソース化しました。

論文の概要: RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

関連論文リスト