Fugu-MT 論文翻訳(概要): Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future

論文の概要: Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future

arxiv url: http://arxiv.org/abs/2508.06026v1
Date: Fri, 08 Aug 2025 05:25:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-11 20:39:06.08675
Title: Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future
Title（参考訳）: 時間的自己回帰言語モデル:過去未来を通したチョーゼンの解離
Authors: Yidong Wang, Xin Wang, Cunxiang Wang, Junfeng Fang, Qiufeng Wang, Jianing Chu, Xuran Meng, Shuxun Yang, Libo Qin, Yue Zhang, Wei Ye, Shikun Zhang,
Abstract要約: 自己回帰言語モデル(Self-Rewarding Language Models)は、LLM-as-a-Judgeプロンプトを通じて、大きな言語モデル(LLM)が応答を生成し、独自の出力を評価するアーキテクチャを提案する。本研究では,過去,現在,将来のモデル世代を戦略的に調整し,学習信号を持続するテキストbf自己回帰言語モデルを提案する。
参考スコア（独自算出の注目度）: 38.1810626252963
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing Self-Rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose \textbf{Temporal Self-Rewarding Language Models} that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) \textit{Anchored Rejection} - fixing rejected responses using the past initial model's outputs and (2) \textit{Future-Guided Chosen} - dynamically curating chosen samples using next-generation model predictions. Extensive experiments across three model families (Llama, Qwen, Mistral) and different model sizes (Llama3B/8B/70B) demonstrate significant improvements when trained with our method compared to Self-Rewarding using same computation resources. For example, Llama3.1-8B reaches a 29.44 win rate on AlpacaEval 2.0 with our method, outperforming the Self-Rewarding baseline (19.69) by 9.75. Notably, our method also demonstrates superior out-of-distribution generalization across mathematical reasoning (GSM8K), knowledge-based QA (ARC, TruthfulQA), and code generation (HumanEval) tasks, even though we do not specifically collect such training data.
Abstract（参考訳）: 自己回帰言語モデル(英語版)は、LLM-as-a-Judgeによる応答生成と独自の出力評価を同時に行うアーキテクチャを提案し、反復的直接選好最適化(DPO)を通じて生成能力を動的に改善する。しかし,本分析では,既存の自己回帰パラダイムの限界が明らかにされている。選択された応答と拒否された応答の同期化は,コントラスト標本間の表現的差異を徐々に狭め,効果的な選好学習を損なう。本稿では,過去,現在,将来のモデル世代を戦略的に調整し,学習信号の持続性を維持するための,‘textbf{Temporal Self-Rewarding Language Models’を提案する。当社の2相フレームワークでは,(1) \textit{Anchored Rejection},(2) \textit{Future-Guided Chosen},(2) \textit{Future-Guided Chosen} – 選択したサンプルを次世代モデル予測を用いて動的にキュレートする。 3つのモデルファミリ(Llama, Qwen, Mistral)と異なるモデルサイズ(Llama3B/8B/70B)にまたがる広範囲な実験により,同じ計算資源を用いた自己回帰と比較して,本手法を訓練した場合の大幅な改善が示された。例えば、Llama3.1-8BはAlpacaEval 2.0で29.44勝率に達し、自己回帰ベースライン(19.69)を9.75で上回っている。また,本手法は,数学的推論 (GSM8K) や知識に基づくQA (ARC, TruthfulQA) やコード生成 (HumanEval) といったタスクにまたがって,そのようなトレーニングデータを特に収集していない場合でも,優れた分布の一般化を示す。

論文の概要: Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future

関連論文リスト