Fugu-MT 論文翻訳(概要): LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

論文の概要: LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

arxiv url: http://arxiv.org/abs/2606.18388v1
Date: Tue, 16 Jun 2026 18:33:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:50.847934
Title: LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents
Title（参考訳）: LLMZero:LLMエージェントによるRL後トレーニングのための適応的トレーニング戦略の発見
Authors: Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang,
Abstract要約: トレーニング後の戦略はデータセットに依存しており、繰り返し発生する経験的パターンを明らかにする。正規化パラメータは、シフトトレーニングのダイナミクスに応答して発振する。我々は,LLMエージェントが木探索を通じて学習軌跡を探索するシステムを用いた。
参考スコア（独自算出の注目度）: 51.74109282213905
License: http://creativecommons.org/licenses/by/4.0/
Abstract: RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.
Abstract（参考訳）: キャパシティパラメータはステージ毎に単調に蓄積され、正規化パラメータはシフトトレーニングのダイナミクスに応じて主として振動する。この区別は、固定スケジュールが全てのパラメータを固定軌道にコミットするので、正規化が追跡すべき静止しない探索・探索のトレードオフを表現できないためである。 LLMエージェントが木探索、各チェックポイントでの病理診断、協調した多パラメータ遷移の提案を通じて、トレーニング軌跡を探索するシステムであるLLMZeroを通してこれを発見する。 LLMZeroは、4つの多様なGRPOタスクにわたって、ベースモデルよりも9%から140%、グリッドサーチより6%から15%改善する戦略を発見し、ランダムサーチとスキルベースエージェントを一貫して上回る。構造原理はタスク間で伝達され、発見された戦略が定性的に異なる形式を取るが、同様のパラメータのダイナミクスを共有する理由を説明する。

論文の概要: LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

関連論文リスト