Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents
- URL: http://arxiv.org/abs/2509.14480v1
- Date: Wed, 17 Sep 2025 23:25:00 GMT
- Title: Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents
- Authors: Weiting Tan, Xinghua Qu, Ming Tu, Meng Ge, Andy T. Liu, Philipp Koehn, Lu Lu,
- Abstract summary: We introduce a sandbox environment for reinforcement learning (RL) that supports interleaved speech-text rollouts.<n>Our core strategy, Turn-level Adjudicated Reinforcement Learning (TARL), addresses the challenge of credit assignment in long-horizon tasks.<n>This unified approach boosts the task pass rate on the text-based $tau$-bench by over 6% compared to strong RL baselines.
- Score: 34.720205364467546
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Effective interactive tool use requires agents to master Tool Integrated Reasoning (TIR): a complex process involving multi-turn planning and long-context dialogue management. To train agents for this dynamic process, particularly in multi-modal contexts, we introduce a sandbox environment for reinforcement learning (RL) that supports interleaved speech-text rollouts. Our core strategy, Turn-level Adjudicated Reinforcement Learning (TARL), addresses the challenge of credit assignment in long-horizon tasks by employing a Large Language Model (LLM) as a judge to provide turn-level evaluation. To enhance exploration, we integrate a mixed-task training curriculum with mathematical reasoning problems. This unified approach boosts the task pass rate on the text-based $\tau$-bench by over 6% compared to strong RL baselines. Crucially, we demonstrate our framework's suitability for fine-tuning a multi-modal foundation model for agentic tasks. By training a base multi-modal LLM on interleaved speech-text rollouts, we equip it with tool-use abilities, paving the way for more natural, voice-driven interactive agents.
Related papers
- Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue [28.25180116201176]
We propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process.<n>We first establish a User-centric Interaction Framework to provide a high-fidelity training gym.<n>Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy.
arXiv Detail & Related papers (2026-02-26T07:19:57Z) - Multi-Agent Tool-Integrated Policy Optimization [67.12841355267678]
Large language models (LLMs) increasingly rely on multi-turn tool-integrated planning for knowledge-intensive and complex reasoning tasks.<n>Existing implementations typically rely on a single agent, but they suffer from limited context length and noisy tool responses.<n>No existing methods support effective reinforcement learning post-training of tool-integrated multi-agent frameworks.
arXiv Detail & Related papers (2025-10-06T10:44:04Z) - Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning [0.21845291030915975]
ARTIST is a unified framework that tightly couples agentic reasoning, reinforcement learning, and tool integration for large language models.<n>It enables models to autonomously decide when, how, and which tools to invoke within multi-turn reasoning chains.<n>Experiments show that ARTIST consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-28T10:42:49Z) - LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL)
Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning.
We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z) - LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language
Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs)
Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - Self-Explanation Prompting Improves Dialogue Understanding in Large
Language Models [52.24756457516834]
We propose a novel "Self-Explanation" prompting strategy to enhance the comprehension abilities of Large Language Models (LLMs)
This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks.
Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts.
arXiv Detail & Related papers (2023-09-22T15:41:34Z) - High-Quality Diversification for Task-Oriented Dialogue Systems [18.455916009255485]
Training DRL agents with diverse dialogue trajectories prepare them well for rare user requests and unseen situations.
One effective diversification method is to let the agent interact with a diverse set of learned user models.
We propose a novel dialogue diversification method for task-oriented dialogue systems trained in simulators.
arXiv Detail & Related papers (2021-06-02T02:10:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.