Related papers: MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization

MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization

URL: http://arxiv.org/abs/2511.19253v1
Date: Mon, 24 Nov 2025 16:05:37 GMT
Title: MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization
Authors: Boyuan Wu,
Abstract summary: Existing approaches rely on fixed-generated Large Language Models (LLMs) directly in the control loop.<n>We propose MAESTRO, a framework that moves the LLM outside the execution loop and uses it as an offline training architect.<n>We evaluate MAESTRO on large-scale traffic signal control (Hangzhou, 16 intersections) and conduct controlled ablations.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cooperative Multi-Agent Reinforcement Learning (MARL) faces two major design bottlenecks: crafting dense reward functions and constructing curricula that avoid local optima in high-dimensional, non-stationary environments. Existing approaches rely on fixed heuristics or use Large Language Models (LLMs) directly in the control loop, which is costly and unsuitable for real-time systems. We propose MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization), a framework that moves the LLM outside the execution loop and uses it as an offline training architect. MAESTRO introduces two generative components: (i) a semantic curriculum generator that creates diverse, performance-driven traffic scenarios, and (ii) an automated reward synthesizer that produces executable Python reward functions adapted to evolving curriculum difficulty. These components guide a standard MARL backbone (MADDPG) without increasing inference cost at deployment. We evaluate MAESTRO on large-scale traffic signal control (Hangzhou, 16 intersections) and conduct controlled ablations. Results show that combining LLM-generated curricula with LLM-generated reward shaping yields improved performance and stability. Across four seeds, the full system achieves +4.0% higher mean return (163.26 vs. 156.93) and 2.2% better risk-adjusted performance (Sharpe 1.53 vs. 0.70) over a strong curriculum baseline. These findings highlight LLMs as effective high-level designers for cooperative MARL training.

Related papers

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering [52.67783579040657]
AceGRPO is a machine learning system that prioritizes tasks at the agent's learning frontier to maximize learning efficiency.<n>Our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines.
arXiv Detail & Related papers (2026-02-08T10:55:03Z)
Beyond Single LLMs: Enhanced Code Generation via Multi-Stage Performance-Guided LLM Orchestration [12.674888937998086]
Large Language Models (LLMs) have become the predominant paradigm for automated code generation.<n>This paper challenges the single-model convention by introducing a multi-stage, performance-guided orchestration framework.<n>Perch orchestrates top-performing LLMs for each task context through stage-wise validation and rollback mechanisms.
arXiv Detail & Related papers (2025-10-01T19:07:16Z)
LLM4CMO: Large Language Model-aided Algorithm Design for Constrained Multiobjective Optimization [54.35609820607923]
Large language models (LLMs) offer new opportunities for assisting with algorithm design.<n>We propose LLM4CMO, a novel CMOEA based on a dual-population, two-stage framework.<n>LLMs can serve as efficient co-designers in the development of complex evolutionary optimization algorithms.
arXiv Detail & Related papers (2025-08-16T02:00:57Z)
Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling [66.0871543682453]
We present Omni-Thinker, a unified reinforcement learning framework that scales large language models across diverse tasks.<n>Our scheduler orders tasks according to accuracy backward transfer (BWT), reducing forgetting and improving multi-task performance.
arXiv Detail & Related papers (2025-07-20T01:50:16Z)
LAMARL: LLM-Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation [12.098817831819078]
Large Language Models (LLMs) have shown promise in single-robot settings, but their application in multi-robot systems remains largely unexplored.<n>This paper introduces a novel LLM-Aided MARL (LAMARL) approach, which integrates MARL with LLMs, significantly enhancing sample efficiency without requiring manual design.
arXiv Detail & Related papers (2025-06-02T10:59:54Z)
Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking [61.61356842567952]
We propose STeP, a novel method for improving LLM-based agent training.<n>We synthesize self-reflected trajectories that include reflections and corrections of error steps.<n>Experiments demonstrate that our method improves agent performance across three representative tasks.
arXiv Detail & Related papers (2025-05-26T14:11:12Z)
Direct Retrieval-augmented Optimization: Synergizing Knowledge Selection and Language Models [83.8639566087953]
We propose a direct retrieval-augmented optimization framework, named DRO, that enables end-to-end training of two key components.<n>DRO alternates between two phases: (i) document permutation estimation and (ii) re-weighted, progressively improving RAG components.<n>Our theoretical analysis reveals that DRO is analogous to policy-gradient methods in reinforcement learning.
arXiv Detail & Related papers (2025-05-05T23:54:53Z)
LERO: LLM-driven Evolutionary framework with Hybrid Rewards and Enhanced Observation for Multi-Agent Reinforcement Learning [4.343021413805699]
Multi-agent reinforcement learning (MARL) faces two critical bottlenecks distinct from single-agent RL.<n>We propose LERO, a framework integrating Large language models (LLMs) with evolutionary optimization to address these MARL-specific challenges.
arXiv Detail & Related papers (2025-03-25T06:28:42Z)
Dynamic Optimizations of LLM Ensembles with Two-Stage Reinforcement Learning Agents [31.341487297459995]
This paper introduces RL-Focal, a two-stage RL agent framework that routes and ensembles LLMs.<n>By focal diversity, we enhance performance across tasks by effectively promoting reward-aware and policy-adaptive ensemble selection and inference fusion.
arXiv Detail & Related papers (2025-02-06T20:44:26Z)
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [68.29746557968107]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans.<n> Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z)
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models [69.51130760097818]
We propose Zooter, a reward-guided routing method distilling rewards on training queries to train a routing function. We evaluate Zooter on a comprehensive benchmark collection with 26 subsets on different domains and tasks.
arXiv Detail & Related papers (2023-11-15T04:40:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.