StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
- URL: http://arxiv.org/abs/2403.07714v4
- Date: Wed, 19 Jun 2024 11:59:08 GMT
- Title: StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
- Authors: Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, Yang Liu,
- Abstract summary: We introduce StableToolBench, a benchmark evolving from ToolBench.
The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status.
The stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation.
- Score: 74.88844320554284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.
Related papers
- MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models [66.64809260956312]
We propose a multi-granularity tool-use benchmark for large language models called MTU-Bench.
Our MTU-Bench is collected by transforming existing high-quality datasets to simulate real-world tool usage scenarios.
Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench.
arXiv Detail & Related papers (2024-10-15T15:46:17Z) - Learning Evolving Tools for Large Language Models [44.25796648300785]
We propose ToolEVO to enhance the adaptive and reflective capabilities of large language models (LLMs) against tool variability.
By leveraging Monte Carlo Tree Search, ToolEVO facilitates active exploration and interaction of LLMs within dynamic environments.
We also introduce ToolQA-D, a benchmark specifically designed to evaluate the impact of tool variability.
arXiv Detail & Related papers (2024-10-09T07:14:45Z) - SEAL: Suite for Evaluating API-use of LLMs [1.2528321519119252]
SEAL is an end-to-end testbed designed to evaluate large language models in real-world API usage.
It standardizes existing benchmarks, integrates an agent system for testing API retrieval and planning, and addresses the instability of real-time APIs.
arXiv Detail & Related papers (2024-09-23T20:16:49Z) - ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities [30.030101957186595]
ToolSandbox is an evaluation framework for large language models (LLMs)
ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation.
We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs.
arXiv Detail & Related papers (2024-08-08T05:45:42Z) - Chain of Tools: Large Language Model is an Automatic Multi-tool Learner [54.992464510992605]
Automatic Tool Chain (ATC) is a framework that enables the large language models (LLMs) to act as a multi-tool user.
To scale up the scope of the tools, we next propose a black-box probing method.
For a comprehensive evaluation, we build a challenging benchmark named ToolFlow.
arXiv Detail & Related papers (2024-05-26T11:40:58Z) - AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls [30.792186243538037]
We introduce AnyTool, a large language model agent designed to revolutionize the utilization of a vast array of tools in addressing user queries.
We utilize over 16,000 APIs from Rapid API, operating under the assumption that a subset of these APIs could potentially resolve the queries.
AnyTool primarily incorporates three elements: an API retriever with a hierarchical structure, a solver aimed at resolving user queries using a selected set of API candidates, and a self-reflection mechanism.
arXiv Detail & Related papers (2024-02-06T18:59:57Z) - Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios [93.68764280953624]
UltraTool is a novel benchmark designed to improve and evaluate Large Language Models' ability in tool utilization.
It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving.
A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage.
arXiv Detail & Related papers (2024-01-30T16:52:56Z) - ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world
APIs [104.37772295581088]
Open-source large language models (LLMs), e.g., LLaMA, remain significantly limited in tool-use capabilities.
We introduce ToolLLM, a general tool-usetuning encompassing data construction, model training, and evaluation.
We first present ToolBench, an instruction-tuning framework for tool use, which is constructed automatically using ChatGPT.
arXiv Detail & Related papers (2023-07-31T15:56:53Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.