LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls
- URL: http://arxiv.org/abs/2511.09148v2
- Date: Tue, 18 Nov 2025 07:03:59 GMT
- Title: LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls
- Authors: Kangning Zhang, Wenxiang Jiao, Kounianhua Du, Yuan Lu, Weiwen Liu, Weinan Zhang, Yong Yu,
- Abstract summary: LoopTool is a fully automated, model-aware data evolution framework.<n>It iteratively refines both the data and the model through three synergistic modules.<n> Experiments show that our 8B model trained with LoopTool significantly surpasses its 32B data generator.
- Score: 46.34510189812439
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Augmenting Large Language Models (LLMs) with external tools enables them to execute complex, multi-step tasks. However, tool learning is hampered by the static synthetic data pipelines where data generation and model training are executed as two separate, non-interactive processes. This approach fails to adaptively focus on a model's specific weaknesses and allows noisy labels to persist, degrading training efficiency. We introduce LoopTool, a fully automated, model-aware data evolution framework that closes this loop by tightly integrating data synthesis and model training. LoopTool iteratively refines both the data and the model through three synergistic modules: (1) Greedy Capability Probing (GCP) diagnoses the model's mastered and failed capabilities; (2) Judgement-Guided Label Verification (JGLV) uses an open-source judge model to find and correct annotation errors, progressively purifying the dataset; and (3) Error-Driven Data Expansion (EDDE) generates new, challenging samples based on identified failures. This closed-loop process operates within a cost-effective, open-source ecosystem, eliminating dependence on expensive closed-source APIs. Experiments show that our 8B model trained with LoopTool significantly surpasses its 32B data generator and achieves new state-of-the-art results on the BFCL-v3 and ACEBench benchmarks for its scale. Our work demonstrates that closed-loop, self-refining data pipelines can dramatically enhance the tool-use capabilities of LLMs.
Related papers
- From Failure to Mastery: Generating Hard Samples for Tool-use Agents [40.331752086107265]
HardGen is an automatic agentic pipeline designed to generate hard tool-use training samples with verifiable reasoning.<n>The advanced tools and hard queries enable the generation of verifiable complex Chain-of-Thought (CoT)<n>Our code, models, and dataset will be open-sourced to facilitate future research.
arXiv Detail & Related papers (2026-01-04T11:56:33Z) - Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing [16.839489120513505]
InfTool orchestrates three collaborative agents to generate diverse, verified trajectories spanning single-turn calls to complex multi-step gated calls.<n>We show that InfTool transforms a base 32B model from 19.8% to 70.9% accuracy (+258%), surpassing models 10x larger and rivaling Claude-Opus.
arXiv Detail & Related papers (2025-12-29T17:12:39Z) - ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs [40.70833390513187]
We introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance.<n>ToolForge synthesizes large-scale tool-learning data specifically designed for multi-hop search scenarios.<n> Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks.
arXiv Detail & Related papers (2025-12-18T04:06:26Z) - Procedural Environment Generation for Tool-Use Agents [55.10427063893754]
We introduce RandomWorld, a pipeline for the procedural generation of interactive tools and compositional tool-use data.<n>We show that models tuned via SFT and RL on synthetic RandomWorld data improve on a range of tool-use benchmarks.
arXiv Detail & Related papers (2025-05-21T14:10:06Z) - Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations.<n>They generate only a limited range of perturbations for a single Information Extraction (IE) task.<n>Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench.<n>We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z) - iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use [56.31110409360567]
Augmenting large language models with external tools is a promising approach to enhance their capabilities.<n>We show that training gains significantly decay as synthetic data increases.<n>We propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation.
arXiv Detail & Related papers (2025-01-15T04:52:34Z) - TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use [72.32614703504122]
Large language models (LLMs) achieve remarkable advancements by leveraging tools to interact with environments.<n>Standard supervised fine-tuning approach, which relies on large-scale datasets, often overlooks task-specific characteristics in tool use.<n>We propose TL-Training, a task-feature-based framework that mitigates the effects of suboptimal training data.
arXiv Detail & Related papers (2024-12-20T02:21:36Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data.<n>We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.