ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use
- URL: http://arxiv.org/abs/2501.02506v2
- Date: Tue, 07 Jan 2025 09:13:35 GMT
- Title: ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use
- Authors: Junjie Ye, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Zehui Chen, Zaiyuan Wang, Sining Zhu, Zhiheng Xi, Siyu Yuan, Tao Gui, Qi Zhang, Xuanjing Huang, Jiecao Chen,
- Abstract summary: We present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools.
ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers.
We evaluate 14 LLMs across five model families, uncovering significant challenges in handling multi-hop tool-use scenarios.
- Score: 51.43211624452462
- License:
- Abstract: Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in https://huggingface.co/datasets/bytedance-research/ToolHop.
Related papers
- Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use.
MeCo captures high-level cognitive signals in the representation space, guiding when to invoke tools.
Our experiments show that MeCo accurately detects LLMs' internal cognitive signals and significantly improves tool-use decision-making.
arXiv Detail & Related papers (2025-02-18T15:45:01Z) - Efficient and Scalable Estimation of Tool Representations in Vector Space [34.767193045989515]
We present a framework for generating synthetic data for tool retrieval applications and an efficient data-driven tool retrieval strategy using small encoder models.
We create ToolBank, a new tool retrieval dataset that reflects real human user usages.
With these new methods, we achieve improvements of up to 27.28 in Recall@K on the ToolBench dataset and 30.5 in Recall@K on ToolBank.
arXiv Detail & Related papers (2024-09-02T19:39:24Z) - ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models [43.895478182631116]
Tool-augmented large language models (LLMs) are rapidly being integrated into real-world applications.
To address this challenge, we introduce a comprehensive diagnostic benchmark, ToolBH.
For breadth, we consider three scenarios based on the characteristics of the toolset: missing necessary tools, potential tools, and limited functionality tools.
The results show the significant challenges presented by the ToolBH benchmark.
arXiv Detail & Related papers (2024-06-28T16:03:30Z) - Enhancing Tool Retrieval with Iterative Feedback from Large Language Models [9.588592185027455]
Large language models (LLMs) can effectively handle a certain amount of tools through in-context learning or fine-tuning.
In real-world scenarios, the number of tools is typically extensive and irregularly updated, emphasizing the necessity for a dedicated tool retrieval component.
We propose to enhance tool retrieval with iterative feedback from the large language model.
arXiv Detail & Related papers (2024-06-25T11:12:01Z) - Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models.
Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions.
We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z) - What Are Tools Anyway? A Survey from the Language Model Perspective [67.18843218893416]
Language models (LMs) are powerful yet mostly for text generation tasks.
We provide a unified definition of tools as external programs used by LMs.
We empirically study the efficiency of various tooling methods.
arXiv Detail & Related papers (2024-03-18T17:20:07Z) - ToolTalk: Evaluating Tool-Usage in a Conversational Setting [6.792842055445584]
This paper introduces ToolTalk, a benchmark consisting of complex user intents requiring multi-step tool usage specified through dialogue.
ToolTalk contains 28 tools grouped into 7 plugins, and includes a complete simulated implementation of each tool.
We evaluate GPT-3.5 and GPT-4 on ToolTalk resulting in success rates of 26% and 50% respectively.
arXiv Detail & Related papers (2023-11-15T23:50:31Z) - Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z) - API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs [84.45284695156771]
API-Bank is a groundbreaking benchmark for tool-augmented Large Language Models.
We develop a run evaluation system consisting of 73 API tools.
We construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains.
arXiv Detail & Related papers (2023-04-14T14:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.