Fugu-MT 論文翻訳(概要): GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

論文の概要: GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

arxiv url: http://arxiv.org/abs/2512.13043v1
Date: Mon, 15 Dec 2025 07:11:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-16 17:54:56.565367
Title: GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
Title（参考訳）: GTR-Turbo:Merged CheckpointはエージェントVLMトレーニングのための無料教師
Authors: Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye,
Abstract要約: 視覚言語モデル(VLM)上に構築されたマルチモーダルエージェントのためのマルチターン強化学習(RL)は、スパース報酬と長期クレジット割り当てによって妨げられる。近年の方法は、段階的なフィードバックを提供する教師、例えば、ガイドド・シント・強化(GTR)とオン・ポリシィ蒸留(On-Policy Distillation)をクエリすることで報酬を強化する。本稿では,GTRの高効率アップグレードであるGTR-Turboについて紹介する。
参考スコア（独自算出の注目度）: 70.77088051192334
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.
Abstract（参考訳）: 視覚言語モデル(VLM)上に構築されたマルチモーダルエージェントのためのマルチターン強化学習(RL)は、スパース報酬と長期クレジット割り当てによって妨げられる。近年の方法は、段階的なフィードバックを提供する教師、例えば、ガイドド思考強化(GTR)やオン・ポリシィ蒸留(On-Policy Distillation)を問うことで報酬を強化するが、コストがかかる特権モデルに頼り、実用性や再現性を制限している。本稿では,GTRの高効率アップグレードであるGTR-Turboについて紹介する。具体的には、GTR-Turboは、進行中のRLトレーニング中に生成されたチェックポイントの重量をマージし、このマージされたモデルを「無料」の教師として使用して、監督された微調整またはソフトロジット蒸留を通してRLを誘導する。この設計では、特権付きVLM(例えば、GPTまたはGemini)への依存を排除し、以前の作業で観察された「エントロピー崩壊」を緩和し、トレーニングを安定に保つ。 GTR-Turboは、多様な視覚エージェントタスク全体にわたって、ベースラインモデルの精度を10～30%向上し、ウォールクロックのトレーニング時間を50%削減し、GTRと比較して計算コストを60%削減した。

論文の概要: GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

関連論文リスト