Related papers: From Failure to Mastery: Generating Hard Samples for Tool-use Agents

From Failure to Mastery: Generating Hard Samples for Tool-use Agents

URL: http://arxiv.org/abs/2601.01498v1
Date: Sun, 04 Jan 2026 11:56:33 GMT
Title: From Failure to Mastery: Generating Hard Samples for Tool-use Agents
Authors: Bingguang Hao, Zengzhuang Xu, Yuntao Wen, Xinyi Xu, Yang Liu, Tong Zhao, Maolin Wang, Long Chen, Dong Wang, Yicheng Chen, Cunyin Peng, Xiangyu Zhao, Chenyi Zhuang, Ji Zhang,
Abstract summary: HardGen is an automatic agentic pipeline designed to generate hard tool-use training samples with verifiable reasoning.<n>The advanced tools and hard queries enable the generation of verifiable complex Chain-of-Thought (CoT)<n>Our code, models, and dataset will be open-sourced to facilitate future research.
Score: 40.331752086107265
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The advancement of LLM agents with tool-use capabilities requires diverse and complex training corpora. Existing data generation methods, which predominantly follow a paradigm of random sampling and shallow generation, often yield simple and homogeneous trajectories that fail to capture complex, implicit logical dependencies. To bridge this gap, we introduce HardGen, an automatic agentic pipeline designed to generate hard tool-use training samples with verifiable reasoning. Firstly, HardGen establishes a dynamic API Graph built upon agent failure cases, from which it samples to synthesize hard traces. Secondly, these traces serve as conditional priors to guide the instantiation of modular, abstract advanced tools, which are subsequently leveraged to formulate hard queries. Finally, the advanced tools and hard queries enable the generation of verifiable complex Chain-of-Thought (CoT), with a closed-loop evaluation feedback steering the continuous refinement of the process. Extensive evaluations demonstrate that a 4B parameter model trained with our curated dataset achieves superior performance compared to several leading open-source and closed-source competitors (e.g., GPT-5.2, Gemini-3-Pro and Claude-Opus-4.5). Our code, models, and dataset will be open-sourced to facilitate future research.

Related papers

From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents [23.583947864141162]
EigenData is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers.<n>Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training.<n>Our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.
arXiv Detail & Related papers (2026-01-30T06:01:23Z)
ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas [13.919124676472022]
ASTRA is an end-to-end framework for training tool-augmented language model agents.<n>ASTRA integrates scalable data synthesis and verifiable reinforcement learning.<n> Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance.
arXiv Detail & Related papers (2026-01-29T11:22:23Z)
LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls [46.34510189812439]
LoopTool is a fully automated, model-aware data evolution framework.<n>It iteratively refines both the data and the model through three synergistic modules.<n> Experiments show that our 8B model trained with LoopTool significantly surpasses its 32B data generator.
arXiv Detail & Related papers (2025-11-12T09:34:39Z)
Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks [41.75017840131367]
Large language models (LLMs) have shown impressive promise in code generation.<n>We present a scalable synthetic data generation pipeline that produces nearly 800k instruction-reasoning-code-test quadruplets.
arXiv Detail & Related papers (2025-10-27T10:54:25Z)
FABRIC: Framework for Agent-Based Realistic Intelligence Creation [3.940391073007047]
Large language models (LLMs) are increasingly deployed as agents, expected to decompose goals, invoke tools, and verify results in dynamic environments.<n>We present a unified framework for synthesizing agentic data using only LLMs, without any human-in-the-loop supervision.
arXiv Detail & Related papers (2025-10-20T18:20:22Z)
CoDA: Agentic Systems for Collaborative Data Visualization [57.270599188947294]
Deep research has revolutionized data analysis, yet data scientists still devote substantial time to manually crafting visualizations.<n>Existing approaches, including simple single- or multi-agent systems, often oversimplify the task.<n>We introduce CoDA, a multi-agent system that employs specialized LLM agents for metadata analysis, task planning, code generation, and self-reflection.
arXiv Detail & Related papers (2025-10-03T17:30:16Z)
ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients" [53.7887350405379]
Prior work synthesizes tool-use LLM datasets by first generating a user query, followed by complex tool-use annotations like DFS.<n>We introduce ToolGrad, an agentic framework that inverts this paradigm. ToolGrad first constructs valid tool-use chains through an iterative process guided by textual "gradients"<n>This "answer-first" approach led to ToolGrad-5k, a dataset generated with more complex tool use, lower cost, and 100% pass rate.
arXiv Detail & Related papers (2025-08-06T05:04:00Z)
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning [63.31585771716123]
Large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL)<n>We introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning.<n>Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training.
arXiv Detail & Related papers (2025-05-22T09:00:19Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Large Language Models as Realistic Microservice Trace Generators [48.730974361862366]
This paper proposes a first-of-a-kind approach that relies on training a large language model (LLM) to generate synthetic workload traces.<n>We show that TraceLLM produces diverse, realistic traces under varied conditions, outperforming existing approaches in both accuracy and validity.<n>TraceLLM adapts to downstream trace-related tasks, such as predicting key trace features and infilling missing data.
arXiv Detail & Related papers (2024-12-16T12:48:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.