From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents
- URL: http://arxiv.org/abs/2601.22607v2
- Date: Mon, 02 Feb 2026 23:32:08 GMT
- Title: From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents
- Authors: Jiaxuan Gao, Jiaao Chen, Chuyi He, Wei-Chen Wang, Shusheng Xu, Hanrui Wang, Di Jin, Yi Wu,
- Abstract summary: EigenData is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers.<n>Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training.<n>Our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.
- Score: 23.583947864141162
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions. Post-training such agents is challenging because synthesis for high-quality multi-turn tool-use data is difficult to scale, and reinforcement learning (RL) could face noisy signals caused by user simulation, leading to degraded training efficiency. We propose a unified framework that combines a self-evolving data agent with verifier-based RL. Our system, EigenData, is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers, and improves generation reliability via closed-loop self-evolving process that updates prompts and workflow. Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training with trajectory-level group-relative advantages and dynamic filtering, yielding consistent improvements beyond SFT. Evaluated on tau^2-bench, our best model reaches 73.0% pass^1 on Airline and 98.3% pass^1 on Telecom, matching or exceeding frontier models. Overall, our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.
Related papers
- ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas [13.919124676472022]
ASTRA is an end-to-end framework for training tool-augmented language model agents.<n>ASTRA integrates scalable data synthesis and verifiable reinforcement learning.<n> Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance.
arXiv Detail & Related papers (2026-01-29T11:22:23Z) - Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text [48.25052564552558]
We introduce GEM, a data synthesis pipeline that enables the generation and extraction of multi-turn tool-use trajectories from text corpora.<n>To reduce the computational cost, we further train a specialized Trajectory Synthesizer via supervised fine-tuning.<n>Experiments demonstrate that our GEM-32B achieve a 16.5% improvement on the BFCL V3 Multi-turn benchmark.
arXiv Detail & Related papers (2026-01-15T12:58:46Z) - Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing [16.839489120513505]
InfTool orchestrates three collaborative agents to generate diverse, verified trajectories spanning single-turn calls to complex multi-step gated calls.<n>We show that InfTool transforms a base 32B model from 19.8% to 70.9% accuracy (+258%), surpassing models 10x larger and rivaling Claude-Opus.
arXiv Detail & Related papers (2025-12-29T17:12:39Z) - Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback [51.22403664895878]
Agent2World is a tool-augmented multi-agent framework that achieves strong inference-time world-model generation.<n>It also serves as a data engine for supervised fine-tuning, by grounding generation in multi-agent feedback.
arXiv Detail & Related papers (2025-12-26T18:54:14Z) - ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset [43.45582911794623]
We introduce ToolMind, a high-quality tool-agentic dataset with 160k synthetic data instances.<n>We employ fine-grained turn-level filtering to remove erroneous or suboptimal steps.<n>Models fine-tuned on ToolMind show significant improvements over baselines on several benchmarks.
arXiv Detail & Related papers (2025-11-12T13:01:23Z) - Scaling Agent Learning via Experience Synthesis [100.42712232390532]
Reinforcement learning can empower autonomous agents by enabling self-improvement through interaction.<n>But its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity.<n>We introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind.
arXiv Detail & Related papers (2025-11-05T18:58:48Z) - FunReason-MT Technical Report: Overcoming the Complexity Barrier in Multi-Turn Function Calling [39.45732462111156]
We present FunReason-MT, a novel data synthesis framework for real-world multi-turn tool use.<n>FunReason-MT resolves the complexity barrier in multi-turn FC data by employing Environment-API Graph Interactions.<n>A 4B model built upon FunReason-MT generated data achieves state-of-the-art performance among comparable-sized models.
arXiv Detail & Related papers (2025-10-28T17:15:26Z) - Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms [81.90219895125178]
Web-based 'deep research' agents aim to solve complex question - answering tasks through long-horizon interactions with online tools.<n>These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning.<n>We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing complexity.
arXiv Detail & Related papers (2025-10-15T06:34:46Z) - Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation [65.3648667980258]
Vision-language model (VLM) based GUI agents show promise for automating complex tasks, but face significant challenges in applying reinforcement learning (RL)<n>We propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner.<n>On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA.
arXiv Detail & Related papers (2025-09-28T13:19:20Z) - LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback [121.78866929908871]
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data.<n>We present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback.<n>Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback.
arXiv Detail & Related papers (2025-06-02T22:36:02Z) - APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay [86.01901238059261]
APIGen-MT is a framework that generates verifiable and diverse multi-turn agent data.<n>We train a family of models -- the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters.<n>Our models outperform frontier models such as GPT-4o and Claude 3.5 on $tau$-bench and BFCL benchmarks.
arXiv Detail & Related papers (2025-04-04T17:13:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.