Fugu-MT 論文翻訳(概要): AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

論文の概要: AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

arxiv url: http://arxiv.org/abs/2605.15565v1
Date: Fri, 15 May 2026 03:13:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:26.157498
Title: AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs
Title（参考訳）: AstraFlow: エージェントLLMのためのデータフロー指向強化学習
Authors: Haizhong Zheng, Yizhuo Di, Jiahui Wang, Shuowei Jin, Xueshen Liu, Yongji Wu, Z. Morley Mao, Ion Stoica, Jiawei Zhao, Beidi Chen,
Abstract要約: 強化学習(Reinforcement Learning, RL)は、大規模言語モデルの推論、コーディング、ツール使用能力の向上にますます利用されている。エージェントRLは違法に高価である。本稿では,従来のトレーナー中心制御を原則的コンポーネント抽象化に置き換えるデータフロー指向RLシステムであるAstraFlowを提案する。
参考スコア（独自算出の注目度）: 57.86040075371121
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.
Abstract（参考訳）: 強化学習 (Reinforcement Learning, RL) は、大規模言語モデルの推論、コーディング、ツール使用能力の向上にますます利用されているが、エージェント的RLは違法に高価である。 RLをエージェントLLMにスケーリングするには、エラスティック、ヘテロジニアス、クロスリージョンの計算リソースを効率的に使用しながら、マルチポリティクスのコラボレーティブトレーニングを含む複雑なワークロードをサポートする必要がある。既存のLLM RLシステムはこれらの機能の一部をサポートしているが、それぞれの拡張には専用のシステムエンジニアリングが必要であることが多い。この負担は、トレーナー中心の制御アーキテクチャと、RLシステムコンポーネントの原則化された抽象化の欠如から生じる。これらの制約に対処するため,従来のトレーナー中心制御を原則的コンポーネント抽象化に置き換えたデータフロー指向RLシステムであるAstraFlowを提案する。 AstraFlowでは、ロールアウトサービス、データフロー管理、トレーニングを自律的なコンポーネントに分離することで、システムは複雑な多目的エージェントRLワークロードをネイティブにサポートし、多様な計算リソースを効率的に活用することができる。我々はAstraFlowを数学、コード、検索、AgentBenchのワークロードで評価し、同じシステムがマルチポリシートレーニング、弾力性のあるスケーリング、異種クロスリージョン実行、システムレベルのコード変更なしに構成可能なデータアルゴリズムをサポートすることを示した。マルチポリティクスのコラボレーティブトレーニングでは、AstraFlowは既存のRLシステムと同等かそれ以上の精度を達成し、トレーニング時間を2.7倍に高速化する。

論文の概要: AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

関連論文リスト