Related papers: ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

URL: http://arxiv.org/abs/2501.02506v2
Date: Tue, 07 Jan 2025 09:13:35 GMT
Title: ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use
Authors: Junjie Ye, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Zehui Chen, Zaiyuan Wang, Sining Zhu, Zhiheng Xi, Siyu Yuan, Tao Gui, Qi Zhang, Xuanjing Huang, Jiecao Chen,
Abstract summary: We present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools.<n>ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers.<n>We evaluate 14 LLMs across five model families, uncovering significant challenges in handling multi-hop tool-use scenarios.
Score: 51.43211624452462
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in https://huggingface.co/datasets/bytedance-research/ToolHop.

Related papers

TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use [74.47746287181383]
Large language model (LLM)-based agents increasingly rely on tool use to complete real-world tasks.<n>We introduce TRAJECT-Bench, a trajectory-aware benchmark to comprehensively evaluate LLMs' tool use capability.
arXiv Detail & Related papers (2025-10-06T07:30:25Z)
MassTool: A Multi-Task Search-Based Tool Retrieval Framework for Large Language Models [45.63804847907601]
MassTool is a multi-task search-based framework designed to enhance both query representation and tool retrieval accuracy.<n>It employs a two-tower architecture: a tool usage detection tower that predicts the need for function calls, and a tool retrieval tower that leverages a query-centric graph convolution network (QC-GCN) for effective query-tool matching.<n>By jointly optimizing tool usage detection loss, list-wise retrieval loss, and contrastive regularization loss, MassTool establishes a robust dual-step sequential decision-making pipeline for precise query understanding.
arXiv Detail & Related papers (2025-07-01T07:02:26Z)
CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios [30.20881816731553]
The ability of large language models to utilize external tools has enabled them to tackle an increasingly diverse range of tasks.<n>As the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors.<n>How to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as a key research direction for advancing tool learning.
arXiv Detail & Related papers (2025-06-11T17:59:18Z)
FamilyTool: A Multi-hop Personalized Tool Use Benchmark [93.80355496575281]
FamilyTool is a benchmark grounded in a family-based knowledge graph (KG) that simulates personalized, multi-hop tool use scenarios.<n> Experiments reveal significant performance gaps in state-of-the-art Large Language Models (LLMs)<n>FamilyTool serves as a critical resource for evaluating and advancing LLM agents' reasoning, adaptability, and scalability in complex, dynamic environments.
arXiv Detail & Related papers (2025-04-09T10:42:36Z)
Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models [8.573278807410507]
Tool learning can further broaden the usage scenarios of large language models (LLMs) We present a new Tool Learning method Chain-of-Tools. It makes full use of the powerful semantic representation capability of frozen LLMs to finish tool calling in CoT reasoning.
arXiv Detail & Related papers (2025-03-21T01:26:12Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use. MeCo captures high-level cognitive signals in the representation space, guiding when to invoke tools. Our experiments show that MeCo accurately detects LLMs' internal cognitive signals and significantly improves tool-use decision-making.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)
Efficient and Scalable Estimation of Tool Representations in Vector Space [34.767193045989515]
We present a framework for generating synthetic data for tool retrieval applications and an efficient data-driven tool retrieval strategy using small encoder models. We create ToolBank, a new tool retrieval dataset that reflects real human user usages. With these new methods, we achieve improvements of up to 27.28 in Recall@K on the ToolBench dataset and 30.5 in Recall@K on ToolBank.
arXiv Detail & Related papers (2024-09-02T19:39:24Z)
ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models [43.895478182631116]
Tool-augmented large language models (LLMs) are rapidly being integrated into real-world applications. To address this challenge, we introduce a comprehensive diagnostic benchmark, ToolBH. For breadth, we consider three scenarios based on the characteristics of the toolset: missing necessary tools, potential tools, and limited functionality tools. The results show the significant challenges presented by the ToolBH benchmark.
arXiv Detail & Related papers (2024-06-28T16:03:30Z)
Enhancing Tool Retrieval with Iterative Feedback from Large Language Models [9.588592185027455]
Large language models (LLMs) can effectively handle a certain amount of tools through in-context learning or fine-tuning. In real-world scenarios, the number of tools is typically extensive and irregularly updated, emphasizing the necessity for a dedicated tool retrieval component. We propose to enhance tool retrieval with iterative feedback from the large language model.
arXiv Detail & Related papers (2024-06-25T11:12:01Z)
Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models. Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions. We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z)
What Are Tools Anyway? A Survey from the Language Model Perspective [67.18843218893416]
Language models (LMs) are powerful yet mostly for text generation tasks. We provide a unified definition of tools as external programs used by LMs. We empirically study the efficiency of various tooling methods.
arXiv Detail & Related papers (2024-03-18T17:20:07Z)
ToolTalk: Evaluating Tool-Usage in a Conversational Setting [6.792842055445584]
This paper introduces ToolTalk, a benchmark consisting of complex user intents requiring multi-step tool usage specified through dialogue. ToolTalk contains 28 tools grouped into 7 plugins, and includes a complete simulated implementation of each tool. We evaluate GPT-3.5 and GPT-4 on ToolTalk resulting in success rates of 26% and 50% respectively.
arXiv Detail & Related papers (2023-11-15T23:50:31Z)
Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs [84.45284695156771]
API-Bank is a groundbreaking benchmark for tool-augmented Large Language Models. We develop a run evaluation system consisting of 73 API tools. We construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains.
arXiv Detail & Related papers (2023-04-14T14:05:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.