Related papers: ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning

ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning

URL: http://arxiv.org/abs/2503.09501v2
Date: Fri, 14 Mar 2025 05:33:47 GMT
Title: ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning
Authors: Ziyu Wan, Yunxiang Li, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, Ying Wen,
Abstract summary: We introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit meta-thinking behaviors.<n>ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions.<n> Experimental results demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks.
Score: 54.787341008881036
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent research on Reasoning of Large Language Models (LLMs) has sought to further enhance their performance by integrating meta-thinking -- enabling models to monitor, evaluate, and control their reasoning processes for more adaptive and effective problem-solving. However, current single-agent work lacks a specialized design for acquiring meta-thinking, resulting in low efficacy. To address this challenge, we introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit meta-thinking behaviors, encouraging LLMs to think about thinking. ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions. Through iterative reinforcement learning with aligned objectives, these agents explore and learn collaboration, leading to improved generalization and robustness. Experimental results demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks, including competitive-level mathematical benchmarks and LLM-as-a-Judge benchmarks. Comprehensive ablation studies further illustrate the evolving dynamics of each distinct agent, providing valuable insights into how the meta-thinking reasoning process enhances the reasoning capabilities of LLMs.

Related papers

MAGE: Meta-Reinforcement Learning for Language Agents toward Strategic Exploration and Exploitation [11.222602737031101]
We propose MAGE, a meta-RL framework that empowers LLM agents for strategic exploration and exploitation.<n>MAGE utilizes a multi-episode training regime where interaction histories and reflections are integrated into the context window.<n>Experiment results show that MAGE outperforms existing baselines in both exploration and exploitation tasks.
arXiv Detail & Related papers (2026-03-04T03:14:37Z)
rSIM: Incentivizing Reasoning Capabilities of LLMs via Reinforced Strategy Injection [49.74493901036598]
Large language models (LLMs) are post-trained through reinforcement learning (RL) to evolve into Reasoning Language Models (RLMs)<n>This paper proposes a novel reinforced strategy injection mechanism (rSIM) that enables any LLM to become an RLM by employing a small planner.<n> Experimental results show that rSIM enables Qwen2.5-0.5B to become an RLM and significantly outperform Qwen2.5-14B.
arXiv Detail & Related papers (2025-12-09T06:55:39Z)
DRAFT-RL: Multi-Agent Chain-of-Draft Reasoning for Reinforcement Learning-Enhanced LLMs [8.532777609640268]
DRAFT-RL is a novel framework that integrates Chain-of-Draft (CoD) reasoning into multi-agent RL training.<n>We evaluate our method on complex reasoning tasks including code synthesis, symbolic math, and knowledge-intensive QA.
arXiv Detail & Related papers (2025-11-25T16:33:42Z)
AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress [71.02263260394261]
Large language models (LLMs) still encounter challenges in multi-turn decision-making tasks.<n>We build process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process.<n>AgentPRM captures both the interdependence between sequential decisions and their contribution to the final goal.
arXiv Detail & Related papers (2025-11-11T14:57:54Z)
Multi-Agent Evolve: LLM Self-Improve through Co-evolution [53.00458074754831]
Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs)<n>Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data.<n>We propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A.
arXiv Detail & Related papers (2025-10-27T17:58:02Z)
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey [103.32591749156416]
The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL)<n>This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL.
arXiv Detail & Related papers (2025-09-02T17:46:26Z)
Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment [29.617927643991877]
This paper investigates approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents using Reinforcement Learning (RL)<n>We introduce a fine-grained turn-level advantage estimation strategy to enable more precise credit assignment in multi-turn agent interactions.<n>Our method achieves 100% success in tool execution and 50% accuracy in exact answer matching, significantly outperforming baselines.
arXiv Detail & Related papers (2025-05-17T04:09:46Z)
Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A Survey [2.572335031488049]
This survey explores the development of meta-thinking capabilities in Large Language Models (LLMs) from a Multi-Agent Reinforcement Learning (MARL) perspective. By exploring reward mechanisms, self-play, and continuous learning methods in MARL, this survey gives a comprehensive roadmap to building introspective, adaptive, and trustworthy LLMs.
arXiv Detail & Related papers (2025-04-20T07:34:26Z)
Large Reasoning Models in Agent Scenarios: Exploring the Necessity of Reasoning Capabilities [74.35956310688164]
We propose the LaRMA framework, encompassing nine tasks across Tool Usage, Plan Design, and Problem Solving. Our findings address four research questions: LRMs surpass LLMs in reasoning-intensive tasks like Plan Design, leveraging iterative reflection for superior outcomes. LRMs' enhanced reasoning incurs higher computational costs, prolonged processing, and behavioral challenges, including overthinking and fact-ignoring tendencies.
arXiv Detail & Related papers (2025-03-14T04:34:31Z)
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.<n>Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.<n>We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z)
Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents [26.645038049346255]
We propose the Reactive and Reflection agents with Multi-Path Reasoning (RR-MP) Framework.<n>Our approach improves scientific reasoning accuracy by employing a multi-path reasoning mechanism.<n>We conducted zero-shot and few-shot evaluations on tasks involving moral scenarios, college-level physics, and mathematics.
arXiv Detail & Related papers (2024-12-31T13:11:20Z)
From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.<n>We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z)
Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation [49.27250832754313]
We present AgentCOT, a llm-based autonomous agent framework. At each step, AgentCOT selects an action and executes it to yield an intermediate result with supporting evidence. We introduce two new strategies to enhance the performance of AgentCOT.
arXiv Detail & Related papers (2024-09-19T02:20:06Z)
Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models [4.9108308035618515]
Multi-agent reinforcement learning (MARL) methods struggle with the non-stationarity of multi-agent systems. Here, we leverage large language models (LLMs) to create an autonomous agent that can handle these challenges. Our agent, Hypothetical Minds, consists of a cognitively-inspired architecture, featuring modular components for perception, memory, and hierarchical planning over two levels of abstraction.
arXiv Detail & Related papers (2024-07-09T17:57:15Z)
MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning [3.651416979200174]
MMCTAgent is a novel critical thinking agent framework designed to address the inherent limitations of current MLLMs in complex visual reasoning tasks. Inspired by human cognitive processes and critical thinking, MMCTAgent iteratively analyzes multi-modal information, decomposes queries, plans strategies, and dynamically evolves its reasoning.
arXiv Detail & Related papers (2024-05-28T16:55:41Z)
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z)
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z)
Theory of Mind for Multi-Agent Collaboration via Large Language Models [5.2767999863286645]
This study evaluates Large Language Models (LLMs)-based agents in a multi-agent cooperative text game with Theory of Mind (ToM) inference tasks. We observed evidence of emergent collaborative behaviors and high-order Theory of Mind capabilities among LLM-based agents.
arXiv Detail & Related papers (2023-10-16T07:51:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.