Related papers: StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

URL: http://arxiv.org/abs/2403.07714v4
Date: Wed, 19 Jun 2024 11:59:08 GMT
Title: StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
Authors: Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, Yang Liu,
Abstract summary: We introduce StableToolBench, a benchmark evolving from ToolBench. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. The stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation.
Score: 74.88844320554284
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.

Related papers

Machine Learning Pipeline for Software Engineering: A Systematic Literature Review [0.0]
This systematic literature review examines state-of-the-art Machine Learning pipelines designed for software engineering (SE)<n>Our findings show that robust preprocessing, such as SMOTE for data balancing, improves model reliability.<n> Ensemble methods like Random Forest and Gradient Boosting dominate performance across tasks.<n>New metrics like Best Arithmetic Mean (BAM) are emerging in niche applications.
arXiv Detail & Related papers (2025-07-31T15:37:30Z)
Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning [63.2198957755528]
We propose Tool-MVR, a novel Tool-Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations.<n>Specifically, we first introduce Multi-Agent Meta-Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories.<n>Second, we propose Exploration-based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback.
arXiv Detail & Related papers (2025-06-05T04:35:49Z)
A Framework for Testing and Adapting REST APIs as LLM Tools [5.758488787763118]
We present a novel testing framework aimed at evaluating and enhancing the readiness of REST APIs to function as tools for agents. Our framework transforms apis as tools, generates comprehensive test cases for the APIs, tests cases into natural language instructions and evaluates the agent's ability t correctly invoke the API and process its inputs and responses.
arXiv Detail & Related papers (2025-04-22T02:52:08Z)
FamilyTool: A Multi-hop Personalized Tool Use Benchmark [94.1158032740113]
We introduce FamilyTool, a novel benchmark grounded in a family-based knowledge graph (KG) FamilyTool challenges Large Language Models with queries spanning 1 to 3 relational hops. Experiments reveal significant performance gaps in state-of-the-art LLMs.
arXiv Detail & Related papers (2025-04-09T10:42:36Z)
StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7,000+ Real-World APIs [25.600516752905964]
MirrorAPI is a framework that trains specialized LLMs to accurately simulate real API responses. We employ supervised fine-tuning and chain-of-thought reasoning to enhance simulation fidelity. MirrorAPI achieves superior accuracy and stability compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-03-26T13:13:03Z)
MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models [66.64809260956312]
We propose a multi-granularity tool-use benchmark for large language models called MTU-Bench. Our MTU-Bench is collected by transforming existing high-quality datasets to simulate real-world tool usage scenarios. Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench.
arXiv Detail & Related papers (2024-10-15T15:46:17Z)
Learning Evolving Tools for Large Language Models [44.25796648300785]
We propose ToolEVO to enhance the adaptive and reflective capabilities of large language models (LLMs) against tool variability. By leveraging Monte Carlo Tree Search, ToolEVO facilitates active exploration and interaction of LLMs within dynamic environments. We also introduce ToolQA-D, a benchmark specifically designed to evaluate the impact of tool variability.
arXiv Detail & Related papers (2024-10-09T07:14:45Z)
SEAL: Suite for Evaluating API-use of LLMs [1.2528321519119252]
SEAL is an end-to-end testbed designed to evaluate large language models in real-world API usage. It standardizes existing benchmarks, integrates an agent system for testing API retrieval and planning, and addresses the instability of real-time APIs.
arXiv Detail & Related papers (2024-09-23T20:16:49Z)
ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities [30.030101957186595]
ToolSandbox is an evaluation framework for large language models (LLMs) ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs.
arXiv Detail & Related papers (2024-08-08T05:45:42Z)
Chain of Tools: Large Language Model is an Automatic Multi-tool Learner [54.992464510992605]
Automatic Tool Chain (ATC) is a framework that enables the large language models (LLMs) to act as a multi-tool user. To scale up the scope of the tools, we next propose a black-box probing method. For a comprehensive evaluation, we build a challenging benchmark named ToolFlow.
arXiv Detail & Related papers (2024-05-26T11:40:58Z)
AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls [30.792186243538037]
We introduce AnyTool, a large language model agent designed to revolutionize the utilization of a vast array of tools in addressing user queries. We utilize over 16,000 APIs from Rapid API, operating under the assumption that a subset of these APIs could potentially resolve the queries. AnyTool primarily incorporates three elements: an API retriever with a hierarchical structure, a solver aimed at resolving user queries using a selected set of API candidates, and a self-reflection mechanism.
arXiv Detail & Related papers (2024-02-06T18:59:57Z)
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios [93.68764280953624]
UltraTool is a novel benchmark designed to improve and evaluate Large Language Models' ability in tool utilization. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage.
arXiv Detail & Related papers (2024-01-30T16:52:56Z)
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [104.37772295581088]
Open-source large language models (LLMs), e.g., LLaMA, remain significantly limited in tool-use capabilities. We introduce ToolLLM, a general tool-usetuning encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning framework for tool use, which is constructed automatically using ChatGPT.
arXiv Detail & Related papers (2023-07-31T15:56:53Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.