Related papers: From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

URL: http://arxiv.org/abs/2601.22607v2
Date: Mon, 02 Feb 2026 23:32:08 GMT
Title: From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents
Authors: Jiaxuan Gao, Jiaao Chen, Chuyi He, Wei-Chen Wang, Shusheng Xu, Hanrui Wang, Di Jin, Yi Wu,
Abstract summary: EigenData is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers.<n>Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training.<n>Our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.
Score: 23.583947864141162
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions. Post-training such agents is challenging because synthesis for high-quality multi-turn tool-use data is difficult to scale, and reinforcement learning (RL) could face noisy signals caused by user simulation, leading to degraded training efficiency. We propose a unified framework that combines a self-evolving data agent with verifier-based RL. Our system, EigenData, is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers, and improves generation reliability via closed-loop self-evolving process that updates prompts and workflow. Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training with trajectory-level group-relative advantages and dynamic filtering, yielding consistent improvements beyond SFT. Evaluated on tau^2-bench, our best model reaches 73.0% pass^1 on Airline and 98.3% pass^1 on Telecom, matching or exceeding frontier models. Overall, our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.

Related papers

ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas [13.919124676472022]
ASTRA is an end-to-end framework for training tool-augmented language model agents.<n>ASTRA integrates scalable data synthesis and verifiable reinforcement learning.<n> Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance.
arXiv Detail & Related papers (2026-01-29T11:22:23Z)
Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text [48.25052564552558]
We introduce GEM, a data synthesis pipeline that enables the generation and extraction of multi-turn tool-use trajectories from text corpora.<n>To reduce the computational cost, we further train a specialized Trajectory Synthesizer via supervised fine-tuning.<n>Experiments demonstrate that our GEM-32B achieve a 16.5% improvement on the BFCL V3 Multi-turn benchmark.
arXiv Detail & Related papers (2026-01-15T12:58:46Z)
Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing [16.839489120513505]
InfTool orchestrates three collaborative agents to generate diverse, verified trajectories spanning single-turn calls to complex multi-step gated calls.<n>We show that InfTool transforms a base 32B model from 19.8% to 70.9% accuracy (+258%), surpassing models 10x larger and rivaling Claude-Opus.
arXiv Detail & Related papers (2025-12-29T17:12:39Z)
Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback [51.22403664895878]
Agent2World is a tool-augmented multi-agent framework that achieves strong inference-time world-model generation.<n>It also serves as a data engine for supervised fine-tuning, by grounding generation in multi-agent feedback.
arXiv Detail & Related papers (2025-12-26T18:54:14Z)
ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset [43.45582911794623]
We introduce ToolMind, a high-quality tool-agentic dataset with 160k synthetic data instances.<n>We employ fine-grained turn-level filtering to remove erroneous or suboptimal steps.<n>Models fine-tuned on ToolMind show significant improvements over baselines on several benchmarks.
arXiv Detail & Related papers (2025-11-12T13:01:23Z)
Scaling Agent Learning via Experience Synthesis [100.42712232390532]
Reinforcement learning can empower autonomous agents by enabling self-improvement through interaction.<n>But its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity.<n>We introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind.
arXiv Detail & Related papers (2025-11-05T18:58:48Z)
FunReason-MT Technical Report: Overcoming the Complexity Barrier in Multi-Turn Function Calling [39.45732462111156]
We present FunReason-MT, a novel data synthesis framework for real-world multi-turn tool use.<n>FunReason-MT resolves the complexity barrier in multi-turn FC data by employing Environment-API Graph Interactions.<n>A 4B model built upon FunReason-MT generated data achieves state-of-the-art performance among comparable-sized models.
arXiv Detail & Related papers (2025-10-28T17:15:26Z)
Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms [81.90219895125178]
Web-based 'deep research' agents aim to solve complex question - answering tasks through long-horizon interactions with online tools.<n>These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning.<n>We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing complexity.
arXiv Detail & Related papers (2025-10-15T06:34:46Z)
Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation [65.3648667980258]
Vision-language model (VLM) based GUI agents show promise for automating complex tasks, but face significant challenges in applying reinforcement learning (RL)<n>We propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner.<n>On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA.
arXiv Detail & Related papers (2025-09-28T13:19:20Z)
LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback [121.78866929908871]
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data.<n>We present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback.<n>Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback.
arXiv Detail & Related papers (2025-06-02T22:36:02Z)
APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay [86.01901238059261]
APIGen-MT is a framework that generates verifiable and diverse multi-turn agent data.<n>We train a family of models -- the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters.<n>Our models outperform frontier models such as GPT-4o and Claude 3.5 on $tau$-bench and BFCL benchmarks.
arXiv Detail & Related papers (2025-04-04T17:13:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.