Fugu-MT 論文翻訳(概要): Libra: Efficient Resource Management for Agentic RL Post-Training

論文の概要: Libra: Efficient Resource Management for Agentic RL Post-Training

arxiv url: http://arxiv.org/abs/2606.03077v2
Date: Wed, 10 Jun 2026 06:28:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 14:23:44.307014
Title: Libra: Efficient Resource Management for Agentic RL Post-Training
Title（参考訳）: Libra: エージェントRLポストトライニングのための効率的な資源管理
Authors: Kaiwen Chen, Xin Tan, Jingzong Li, Hong Xu,
Abstract要約: 強化学習(Reinforcement Learning, RL)は、大規模言語モデル(LLM)を有能なエージェントに変換するための訓練後の標準パラダイムとして登場した。エージェントRLでは、ロールアウトステージはツールを呼び出しながら軌道を生成し、長い尾と静止しないワークロードを生成する。両課題に対処するリソース管理システムLibraについて,2つのコアメカニズムを用いて紹介する。
参考スコア（独自算出の注目度）: 11.701871372256205
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) has emerged as a standard post-training paradigm for shaping large language models (LLMs) into capable agents. In agentic RL, the rollout stage generates trajectories while invoking tools, producing long-tailed and non-stationary workloads that expose two fundamental challenges in resource management. First, due to the long-tail distribution, a small fraction of trajectories dominates rollout makespan. Second, rollout and training are subject to cross-stage imbalance, as they exhibit strong asymmetry in compute patterns, memory demands, and sensitivity to sequence length. Compounding this asymmetry, the sequence length distribution drifts continuously as the policy evolves, rendering any static resource split progressively suboptimal. We present Libra, a resource management system to address both challenges via two core mechanisms. The first is a global resource planner that jointly optimizes GPU allocation across rollout and training clusters. It leverages an elastic hybrid pool to enable lightweight, non-blocking worker reallocation between stages. The second is a causality-driven multi-level feedback queue (C-MLFQ) scheduler, which routes requests to heterogeneous rollout buckets based on causal signals derived from tool-return outcomes, rather than relying on fragile length predictions. Evaluated on 48 A800 GPUs, Libra achieves up to 3.0x higher throughput and converges up to 2.5x faster in reward compared to the baselines.
Abstract（参考訳）: 強化学習(Reinforcement Learning, RL)は、大規模言語モデル(LLM)を有能なエージェントに変換するための訓練後の標準パラダイムとして登場した。エージェントRLでは、ロールアウトステージはツールを起動しながらトラジェクトリを生成し、リソース管理における2つの根本的な課題を明らかにする長い尾と非定常のワークロードを生成する。第一に、長い尾の分布のため、少数の軌道がロールアウト・メースパンを支配している。第二に、ロールアウトとトレーニングは、計算パターン、メモリ要求、シーケンス長に対する感度において強い非対称性を示すため、段階的不均衡の対象となる。この非対称性を合成して、シーケンス長分布はポリシーが進化するにつれて連続的にドリフトし、静的リソースは徐々に準最適に分裂する。両課題に対処するリソース管理システムLibraについて,2つのコアメカニズムを用いて紹介する。ひとつはグローバルリソースプランナで、ロールアウトとトレーニングクラスタ間のGPUアロケーションを共同で最適化する。弾力性のあるハイブリッドプールを活用して、ステージ間の軽量でノンブロッキングなワーカー再配置を可能にする。 2つ目は因果性駆動型マルチレベルフィードバックキュー(C-MLFQ)スケジューラで、脆弱な長さ予測に頼るのではなく、ツール-リターン結果から導かれる因果信号に基づいて、要求を異種ロールアウトバケットにルーティングする。 48 A800 GPUで評価すると、Libraはスループットを最大3.0倍に向上し、ベースラインに比べて最大2.5倍高速に収束する。

論文の概要: Libra: Efficient Resource Management for Agentic RL Post-Training

関連論文リスト