Related papers: Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents

Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents

URL: http://arxiv.org/abs/2509.14480v1
Date: Wed, 17 Sep 2025 23:25:00 GMT
Title: Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents
Authors: Weiting Tan, Xinghua Qu, Ming Tu, Meng Ge, Andy T. Liu, Philipp Koehn, Lu Lu,
Abstract summary: We introduce a sandbox environment for reinforcement learning (RL) that supports interleaved speech-text rollouts.<n>Our core strategy, Turn-level Adjudicated Reinforcement Learning (TARL), addresses the challenge of credit assignment in long-horizon tasks.<n>This unified approach boosts the task pass rate on the text-based $tau$-bench by over 6% compared to strong RL baselines.
Score: 34.720205364467546
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Effective interactive tool use requires agents to master Tool Integrated Reasoning (TIR): a complex process involving multi-turn planning and long-context dialogue management. To train agents for this dynamic process, particularly in multi-modal contexts, we introduce a sandbox environment for reinforcement learning (RL) that supports interleaved speech-text rollouts. Our core strategy, Turn-level Adjudicated Reinforcement Learning (TARL), addresses the challenge of credit assignment in long-horizon tasks by employing a Large Language Model (LLM) as a judge to provide turn-level evaluation. To enhance exploration, we integrate a mixed-task training curriculum with mathematical reasoning problems. This unified approach boosts the task pass rate on the text-based $\tau$-bench by over 6% compared to strong RL baselines. Crucially, we demonstrate our framework's suitability for fine-tuning a multi-modal foundation model for agentic tasks. By training a base multi-modal LLM on interleaved speech-text rollouts, we equip it with tool-use abilities, paving the way for more natural, voice-driven interactive agents.

Related papers

Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue [28.25180116201176]
We propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process.<n>We first establish a User-centric Interaction Framework to provide a high-fidelity training gym.<n>Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy.
arXiv Detail & Related papers (2026-02-26T07:19:57Z)
Multi-Agent Tool-Integrated Policy Optimization [67.12841355267678]
Large language models (LLMs) increasingly rely on multi-turn tool-integrated planning for knowledge-intensive and complex reasoning tasks.<n>Existing implementations typically rely on a single agent, but they suffer from limited context length and noisy tool responses.<n>No existing methods support effective reinforcement learning post-training of tool-integrated multi-agent frameworks.
arXiv Detail & Related papers (2025-10-06T10:44:04Z)
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning [0.21845291030915975]
ARTIST is a unified framework that tightly couples agentic reasoning, reinforcement learning, and tool integration for large language models.<n>It enables models to autonomously decide when, how, and which tools to invoke within multi-turn reasoning chains.<n>Experiments show that ARTIST consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-28T10:42:49Z)
LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities. PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z)
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL) Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning. We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z)
LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs) Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z)
Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z)
Self-Explanation Prompting Improves Dialogue Understanding in Large Language Models [52.24756457516834]
We propose a novel "Self-Explanation" prompting strategy to enhance the comprehension abilities of Large Language Models (LLMs) This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks. Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts.
arXiv Detail & Related papers (2023-09-22T15:41:34Z)
High-Quality Diversification for Task-Oriented Dialogue Systems [18.455916009255485]
Training DRL agents with diverse dialogue trajectories prepare them well for rare user requests and unseen situations. One effective diversification method is to let the agent interact with a diverse set of learned user models. We propose a novel dialogue diversification method for task-oriented dialogue systems trained in simulators.
arXiv Detail & Related papers (2021-06-02T02:10:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.