Related papers: AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls

AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls

URL: http://arxiv.org/abs/2402.04253v1
Date: Tue, 6 Feb 2024 18:59:57 GMT
Title: AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls
Authors: Yu Du, Fangyun Wei, Hongyang Zhang
Abstract summary: We introduce AnyTool, a large language model agent designed to revolutionize the utilization of a vast array of tools in addressing user queries. We utilize over 16,000 APIs from Rapid API, operating under the assumption that a subset of these APIs could potentially resolve the queries. AnyTool primarily incorporates three elements: an API retriever with a hierarchical structure, a solver aimed at resolving user queries using a selected set of API candidates, and a self-reflection mechanism.
Score: 30.792186243538037
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce AnyTool, a large language model agent designed to revolutionize the utilization of a vast array of tools in addressing user queries. We utilize over 16,000 APIs from Rapid API, operating under the assumption that a subset of these APIs could potentially resolve the queries. AnyTool primarily incorporates three elements: an API retriever with a hierarchical structure, a solver aimed at resolving user queries using a selected set of API candidates, and a self-reflection mechanism, which re-activates AnyTool if the initial solution proves impracticable. AnyTool is powered by the function calling feature of GPT-4, eliminating the need for training external modules. We also revisit the evaluation protocol introduced by previous works and identify a limitation in this protocol that leads to an artificially high pass rate. By revising the evaluation protocol to better reflect practical application scenarios, we introduce an additional benchmark, termed AnyToolBench. Experiments across various datasets demonstrate the superiority of our AnyTool over strong baselines such as ToolLLM and a GPT-4 variant tailored for tool utilization. For instance, AnyTool outperforms ToolLLM by +35.4% in terms of average pass rate on ToolBench. Code will be available at https://github.com/dyabel/AnyTool.

Related papers

Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning [63.2198957755528]
We propose Tool-MVR, a novel Tool-Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations.<n>Specifically, we first introduce Multi-Agent Meta-Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories.<n>Second, we propose Exploration-based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback.
arXiv Detail & Related papers (2025-06-05T04:35:49Z)
Advancing and Benchmarking Personalized Tool Invocation for LLMs [66.39214525683425]
We introduce the concept of Personalized Tool Invocation and define two key tasks: Tool Preference and Profile-dependent Query.<n>To tackle these challenges, we propose PTool, a data synthesis framework designed for personalized tool invocation.<n>We construct textbfPTBench, the first benchmark for evaluating personalized tool invocation.
arXiv Detail & Related papers (2025-05-07T02:25:20Z)
ToolFuzz -- Automated Agent Tool Testing [5.174808367448261]
ToolFuzz is designed to discover two types of errors: (1) user queries leading to tool runtime errors and (2) user queries that lead to incorrect agent responses. We show that ToolFuzz identifies 20x more erroneous inputs compared to the prompt-engineering approaches.
arXiv Detail & Related papers (2025-03-06T14:29:52Z)
Graph RAG-Tool Fusion [0.0]
Graph RAG-Tool Fusion is a novel plug-and-play approach to capture all relevant tools (nodes) along with any nested dependencies (edges) within a tool knowledge graph. We present ToolLinkOS, a new tool selection benchmark of 573 fictional tools, spanning over 15 industries. We demonstrate that Graph RAG-Tool Fusion achieves absolute improvements of 71.7% and 22.1% over na"ive RAG on ToolLinkOS and ToolSandbox benchmarks, respectively.
arXiv Detail & Related papers (2025-02-11T03:32:34Z)
ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use [51.43211624452462]
We present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers. We evaluate 14 LLMs across five model families, uncovering significant challenges in handling multi-hop tool-use scenarios.
arXiv Detail & Related papers (2025-01-05T11:06:55Z)
Efficient and Scalable Estimation of Tool Representations in Vector Space [34.767193045989515]
We present a framework for generating synthetic data for tool retrieval applications and an efficient data-driven tool retrieval strategy using small encoder models. We create ToolBank, a new tool retrieval dataset that reflects real human user usages. With these new methods, we achieve improvements of up to 27.28 in Recall@K on the ToolBench dataset and 30.5 in Recall@K on ToolBank.
arXiv Detail & Related papers (2024-09-02T19:39:24Z)
Chain of Tools: Large Language Model is an Automatic Multi-tool Learner [54.992464510992605]
Automatic Tool Chain (ATC) is a framework that enables the large language models (LLMs) to act as a multi-tool user. To scale up the scope of the tools, we next propose a black-box probing method. For a comprehensive evaluation, we build a challenging benchmark named ToolFlow.
arXiv Detail & Related papers (2024-05-26T11:40:58Z)
Seal-Tools: Self-Instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark [8.573278807410507]
This paper presents a new tool learning dataset Seal-Tools. Seal-Tools contains self-instruct API-like tools. It also includes instances which demonstrate the practical application of tools.
arXiv Detail & Related papers (2024-05-14T06:50:19Z)
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models [74.88844320554284]
We introduce StableToolBench, a benchmark evolving from ToolBench. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. The stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation.
arXiv Detail & Related papers (2024-03-12T14:57:40Z)
EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction [56.02100384015907]
EasyTool is a framework transforming diverse and lengthy tool documentation into a unified and concise tool instruction. It can significantly reduce token consumption and improve the performance of tool utilization in real-world scenarios.
arXiv Detail & Related papers (2024-01-11T15:45:11Z)
MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use [82.24774504584066]
Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities. We introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools.
arXiv Detail & Related papers (2023-10-04T19:39:26Z)
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [104.37772295581088]
Open-source large language models (LLMs), e.g., LLaMA, remain significantly limited in tool-use capabilities. We introduce ToolLLM, a general tool-usetuning encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning framework for tool use, which is constructed automatically using ChatGPT.
arXiv Detail & Related papers (2023-07-31T15:56:53Z)
ToolCoder: Teach Code Generation Models to use API search tools [44.370920906850024]
We propose ToolCoder, a novel approach that integrates API search tools with existing models to assist in code generation and API selection. Our experimental results demonstrate that ToolCoder exhibits excellent performance and generalization across five public and private library code generation benchmarks.
arXiv Detail & Related papers (2023-05-06T12:45:28Z)
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs [84.45284695156771]
API-Bank is a groundbreaking benchmark for tool-augmented Large Language Models. We develop a run evaluation system consisting of 73 API tools. We construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains.
arXiv Detail & Related papers (2023-04-14T14:05:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.