MAXS: Meta-Adaptive Exploration with LLM Agents
- URL: http://arxiv.org/abs/2601.09259v1
- Date: Wed, 14 Jan 2026 07:48:00 GMT
- Title: MAXS: Meta-Adaptive Exploration with LLM Agents
- Authors: Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Yu He, Haoran Luo, li yuan, Lingling Zhang, Rui Mao, Qika Lin, Jun Liu,
- Abstract summary: MaxS is a meta-adaptive reasoning framework based on Large Language Model (LLM) Agents.<n>MAXS employs a lookahead strategy to extend reasoning paths a few steps ahead.<n>It combines step consistency variance and inter-step trend slopes to jointly select stable, consistent, and high-value reasoning steps.
- Score: 48.04723638253802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Model (LLM) Agents exhibit inherent reasoning abilities through the collaboration of multiple tools. However, during agent inference, existing methods often suffer from (i) locally myopic generation, due to the absence of lookahead, and (ii) trajectory instability, where minor early errors can escalate into divergent reasoning paths. These issues make it difficult to balance global effectiveness and computational efficiency. To address these two issues, we propose meta-adaptive exploration with LLM agents https://github.com/exoskeletonzj/MAXS, a meta-adaptive reasoning framework based on LLM Agents that flexibly integrates tool execution and reasoning planning. MAXS employs a lookahead strategy to extend reasoning paths a few steps ahead, estimating the advantage value of tool usage, and combines step consistency variance and inter-step trend slopes to jointly select stable, consistent, and high-value reasoning steps. Additionally, we introduce a trajectory convergence mechanism that controls computational cost by halting further rollouts once path consistency is achieved, enabling a balance between resource efficiency and global effectiveness in multi-tool reasoning. We conduct extensive empirical studies across three base models (MiMo-VL-7B, Qwen2.5-VL-7B, Qwen2.5-VL-32B) and five datasets, demonstrating that MAXS consistently outperforms existing methods in both performance and inference efficiency. Further analysis confirms the effectiveness of our lookahead strategy and tool usage.
Related papers
- Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning [26.401906729658688]
Agentic Reinforcement Learning (ARL) focuses on training large language models to interleave reasoning with external tool execution to solve complex tasks.<n>Most existing ARL methods train a single shared model parameters to support both reasoning and tool use behaviors, implicitly assuming that joint training leads to improved overall agent performance.<n>We show that these two capabilities often induce misaligned gradient directions, leading to training interference that undermines the effectiveness of joint optimization.<n>We propose Disentangled Action Reasoning Tuning(DART), a simple and efficient framework that explicitly decouples parameter updates for reasoning and tool-use via separate low-rank
arXiv Detail & Related papers (2026-02-01T03:19:22Z) - Demystifying Reinforcement Learning in Agentic Reasoning [90.3737088727791]
We conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning.<n>We highlight our key insights: (i) replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT.<n> Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency.
arXiv Detail & Related papers (2025-10-13T17:57:15Z) - MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z) - Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning [22.177866778776814]
We propose a two-stage framework designed to improve both high-level planning and fine-grained Chain-of-Thought (CoT) reasoning.<n>In the first stage, we leverage advanced LLMs to distill CoT into compact high-level guidance, which is then used for supervised fine-tuning.<n>In the second stage, we introduce a guidance-aware RL method that jointly optimize the final output and the quality of high-level guidance.
arXiv Detail & Related papers (2025-10-02T09:28:13Z) - RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [111.1749164063616]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z) - Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z) - Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs [45.83245433138508]
Large language models (LLMs) have rapidly progressed into general-purpose agents capable of solving a broad spectrum of tasks.<n>They apply fixed inference-time compute regardless of task complexity, often overthinking simple problems while underthinking hard ones.<n>This survey presents a comprehensive review of efficient test-time compute strategies, which aim to improve the computational efficiency of LLM reasoning.
arXiv Detail & Related papers (2025-07-02T18:27:42Z) - Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided and Self-Consistent MLLMs for Task Planning in Instruction-Following Manipulation [5.903105418868711]
We introduce textbfQuARC (Quantity, Analysis, Relative positioning, Collision), a new benchmark based on a food preparation scenario.<n>We tackle two major limitations of current MLLMs: cross-modal distraction and geometric infeasibility.<n>Our method achieves a 76.7% success rate on the benchmark, significantly outperforming the ViLa baseline.
arXiv Detail & Related papers (2025-03-17T11:01:02Z) - Learning to Use Tools via Cooperative and Interactive Agents [58.77710337157665]
Tool learning empowers large language models (LLMs) as agents to use external tools and extend their utility.
We propose ConAgents, a Cooperative and interactive Agents framework, which coordinates three specialized agents for tool selection, tool execution, and action calibration separately.
Our experiments on three datasets show that the LLMs, when equipped with ConAgents, outperform baselines with substantial improvement.
arXiv Detail & Related papers (2024-03-05T15:08:16Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.