Related papers: ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs

ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs

URL: http://arxiv.org/abs/2512.16149v1
Date: Thu, 18 Dec 2025 04:06:26 GMT
Title: ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs
Authors: Hao Chen, Zhexin Hu, Jiajun Chai, Haocheng Yang, Hang He, Xiaohan Wang, Wei Lin, Luhang Wang, Guojun Yin, Zhuofeng zhao,
Abstract summary: We introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance.<n>ToolForge synthesizes large-scale tool-learning data specifically designed for multi-hop search scenarios.<n> Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks.
Score: 40.70833390513187
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training LLMs to invoke tools and leverage retrieved information necessitates high-quality, diverse data. However, existing pipelines for synthetic data generation often rely on tens of thousands of real API calls to enhance generalization, incurring prohibitive costs while lacking multi-hop reasoning and self-reflection. To address these limitations, we introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance by constructing only a small number of virtual tools, eliminating the need for real API calls. ToolForge leverages a (question, golden context, answer) triple to synthesize large-scale tool-learning data specifically designed for multi-hop search scenarios, further enriching the generated data through multi-hop reasoning and self-reflection mechanisms. To ensure data fidelity, we employ a Multi-Layer Validation Framework that integrates both rule-based and model-based assessments. Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks. Our code and dataset are publicly available at https://github.com/Buycar-arb/ToolForge .

Related papers

FABRIC: Framework for Agent-Based Realistic Intelligence Creation [3.940391073007047]
Large language models (LLMs) are increasingly deployed as agents, expected to decompose goals, invoke tools, and verify results in dynamic environments.<n>We present a unified framework for synthesizing agentic data using only LLMs, without any human-in-the-loop supervision.
arXiv Detail & Related papers (2025-10-20T18:20:22Z)
RLFactory: A Plug-and-Play Reinforcement Learning Post-Training Framework for LLM Multi-Turn Tool-Use [50.52940111891476]
Large language models excel at basic reasoning but struggle with tasks that require interaction with external tools.<n>We present RLFactory, a plug-and-play reinforcement learning framework for multi-round tool use.
arXiv Detail & Related papers (2025-08-31T16:47:31Z)
Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky [0.5735035463793009]
Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent.<n>We introduce DiaFORGE, a disambiguation-centric, three-stage pipeline that synthesizes persona-driven, multi-turn dialogues.<n>On our benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting.
arXiv Detail & Related papers (2025-07-04T06:49:02Z)
Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges [30.68589269821412]
Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on single-turn interactions.<n>We propose textttDialogTool, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use.
arXiv Detail & Related papers (2025-05-19T16:36:13Z)
RouteNator: A Router-Based Multi-Modal Architecture for Generating Synthetic Training Data for Function Calling LLMs [3.41612427812159]
In digital content creation tools, users express their needs through natural language queries that must be mapped to API calls.<n>Existing approaches to synthetic data generation fail to replicate real-world data distributions.<n>We present a novel router-based architecture that generates high-quality synthetic training data.
arXiv Detail & Related papers (2025-05-15T16:53:45Z)
ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis [80.34000499166648]
We propose a Graph-based Sampling strategy to sample more relevant tool combinations, and a Planned-generation strategy to create plans that guide the synthesis of coherent dialogues.<n>We apply SFT on LLaMA-3.1-8B using 8,000 synthetic dialogues generated with ToolFlow.<n>Results show that the model achieves tool-calling performance comparable to or even surpassing GPT-4, while maintaining strong general capabilities.
arXiv Detail & Related papers (2024-10-24T05:45:04Z)
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.<n>Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.<n>We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.<n>Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z)
ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data.<n>We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z)
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [104.37772295581088]
Open-source large language models (LLMs), e.g., LLaMA, remain significantly limited in tool-use capabilities. We introduce ToolLLM, a general tool-usetuning encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning framework for tool use, which is constructed automatically using ChatGPT.
arXiv Detail & Related papers (2023-07-31T15:56:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.