Fugu-MT 論文翻訳(概要): MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization

論文の概要: MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization

arxiv url: http://arxiv.org/abs/2511.19253v1
Date: Mon, 24 Nov 2025 16:05:37 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-25 18:34:25.294011
Title: MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization
Title（参考訳）: MAESTRO:タスクとリワード最適化によるマルチエージェント環境形成
Authors: Boyuan Wu,
Abstract要約: 既存のアプローチは、制御ループ内で直接、固定生成されたLarge Language Models (LLM) に依存している。我々は,PLMを実行ループ外に移動させるフレームワークであるMAESTROを提案し,オフラインのトレーニングアーキテクトとして使用している。大規模交通信号制御(広州16交差点)におけるMAESTROの評価を行い,その制御方法について検討した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cooperative Multi-Agent Reinforcement Learning (MARL) faces two major design bottlenecks: crafting dense reward functions and constructing curricula that avoid local optima in high-dimensional, non-stationary environments. Existing approaches rely on fixed heuristics or use Large Language Models (LLMs) directly in the control loop, which is costly and unsuitable for real-time systems. We propose MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization), a framework that moves the LLM outside the execution loop and uses it as an offline training architect. MAESTRO introduces two generative components: (i) a semantic curriculum generator that creates diverse, performance-driven traffic scenarios, and (ii) an automated reward synthesizer that produces executable Python reward functions adapted to evolving curriculum difficulty. These components guide a standard MARL backbone (MADDPG) without increasing inference cost at deployment. We evaluate MAESTRO on large-scale traffic signal control (Hangzhou, 16 intersections) and conduct controlled ablations. Results show that combining LLM-generated curricula with LLM-generated reward shaping yields improved performance and stability. Across four seeds, the full system achieves +4.0% higher mean return (163.26 vs. 156.93) and 2.2% better risk-adjusted performance (Sharpe 1.53 vs. 0.70) over a strong curriculum baseline. These findings highlight LLMs as effective high-level designers for cooperative MARL training.
Abstract（参考訳）: MARL(Cooperative Multi-Agent Reinforcement Learning)は、高次元非定常環境における局所的最適性を回避するための高密度報酬関数の構築とカリキュラム構築という、2つの主要な設計ボトルネックに直面している。既存のアプローチは、固定ヒューリスティックスや、制御ループに直接Large Language Models(LLM)を使用する。本稿では,MAESTRO(Multi-Agent Environment Shaping through Task and Reward Optimization)を提案する。 MAESTROは2つの生成成分を導入している。 (i)多種多様なパフォーマンス駆動の交通シナリオを作成するセマンティック・カリキュラム・ジェネレータ (ii) カリキュラムの難易度に適応した実行可能Python報酬関数を生成する自動報酬合成器。これらのコンポーネントは、標準のMARLバックボーン(MADDPG)をデプロイ時の推論コストを増大させることなくガイドする。大規模交通信号制御(杭州16交差点)におけるMAESTROの評価を行い,その制御方法について検討した。その結果, LLM 生成キュリキュラと LLM 生成の報酬形成を組み合わせれば, 性能, 安定性が向上することがわかった。 4つの種にまたがって、完全なシステムは+4.0%高い平均リターン(163.26 vs. 156.93)と2.2%のリスク調整性能(シャープ 1.53 vs. 0.70)を達成する。これらの結果は,協調的MARLトレーニングに有効なハイレベルデザイナーとしてLLMが注目されている。

論文の概要: MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization

関連論文リスト