Fugu-MT 論文翻訳(概要): SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

論文の概要: SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

arxiv url: http://arxiv.org/abs/2605.05216v1
Date: Fri, 17 Apr 2026 01:45:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 06:56:26.601796
Title: SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
Title（参考訳）: SAT:単調な改善保証によるコーディネータフリープラグとマルチLLMトレーニングのためのシークエンシャルエージェントチューニング
Authors: Yi Xie, Yangyang Xu, Yi Fan, Bo Liu,
Abstract要約: 多数のパラメータを持つ大規模言語モデル(LLM)は、高いパフォーマンスを実現するが、しばしばデプロイするのに非常に高価である。最近の研究は、より小さく、より効率的なLLMのチームを使って、単一の大きなモデルに合わせたり、あるいは性能を上回るものを探っている。我々は、コーディネータフリートレーニングパラダイムであるSequential Agent Tuning(SAT)を導入することで、この問題に対処する。
参考スコア（独自算出の注目度）: 20.52379192411959
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) with a large number of parameters achieve strong performance but are often prohibitively expensive to deploy. Recent work explores using teams of smaller, more efficient LLMs that collectively match or even outperform a single large model. However, jointly updating multiple agents introduces compounding distribution shifts, making coordination and stability during training difficult. We address this by introducing Sequential Agent Tuning (SAT), a coordinator-free training paradigm. SAT represents the team as a factorized policy and employs block-coordinate updates over agents, enabling scalable, decentralized training without a central controller. Specifically, we develop a sequence-aware, on-policy advantage estimator that conditions on the evolving team policy, coupled with per-agent KL trust regions that isolate occupancy drift. Theoretically, this framework provides two critical guarantees. First, it ensures monotonic improvement, stabilizing the training process. Second, it establishes provable plug-and-play invariance: any agent can be upgraded to a stronger model without retraining the rest of the team, with a formal guarantee that the performance bound improves. Empirically, a team of three 4B agents (12B total) trained with SAT surpasses the much larger Qwen3-32B on AIME24/25 benchmarks by 3.9\% on average. We validate our plug-and-play theory by swapping in two 8B agents, which boosts the composite score by 10.4\%. We provide code and appendix of proof at https://github.com/Yydc/SAT-AAMAS
Abstract（参考訳）: 多数のパラメータを持つ大規模言語モデル(LLM)は、高いパフォーマンスを実現するが、しばしばデプロイするのに非常に高価である。最近の研究は、より小さく、より効率的なLLMのチームを使って、単一の大きなモデルに合わせたり、あるいは性能を上回るものを探っている。しかし、複数のエージェントを共同で更新することは、複合的な分散シフトを導入し、トレーニング中の調整と安定性を困難にする。我々は、コーディネータフリートレーニングパラダイムであるSequential Agent Tuning(SAT)を導入することで、この問題に対処する。 SATは、そのチームを分解ポリシーとして表現し、エージェントよりもブロックコーディネートアップデートを採用し、中央のコントローラなしでスケーラブルで分散的なトレーニングを可能にする。具体的には、進化するチーム方針の条件と、占有ドリフトを隔離するKL単位の信頼領域を併用した、シーケンス対応のオン・ポリティクス・アドバンテージ・エデュケータを開発する。理論的には、このフレームワークは2つの重要な保証を提供する。まず、単調な改善を確実にし、トレーニングプロセスを安定化します。第二に、証明可能なプラグアンドプレイの不変性を確立します。任意のエージェントは、チームの他の部分を再トレーニングすることなく、より強力なモデルにアップグレードすることができます。実験的に、SATで訓練された3つの4Bエージェント(合計12B)のチームは、AIME24/25ベンチマークのQwen3-32Bを平均3.9%上回った。 2つの8Bエージェントを交換することでプラグ・アンド・プレイ理論が検証され、合成スコアが10.4\%向上する。 We provide code and appendix of proof at https://github.com/Yydc/SAT-AAMAS

論文の概要: SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

関連論文リスト