API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
- URL: http://arxiv.org/abs/2304.08244v2
- Date: Wed, 25 Oct 2023 06:54:12 GMT
- Title: API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
- Authors: Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang
Yu, Zhoujun Li, Fei Huang, Yongbin Li
- Abstract summary: API-Bank is a groundbreaking benchmark for tool-augmented Large Language Models.
We develop a run evaluation system consisting of 73 API tools.
We construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains.
- Score: 84.45284695156771
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has demonstrated that Large Language Models (LLMs) can
enhance their capabilities by utilizing external tools. However, three pivotal
questions remain unanswered: (1) How effective are current LLMs in utilizing
tools? (2) How can we enhance LLMs' ability to utilize tools? (3) What
obstacles need to be overcome to leverage tools? To address these questions, we
introduce API-Bank, a groundbreaking benchmark, specifically designed for
tool-augmented LLMs. For the first question, we develop a runnable evaluation
system consisting of 73 API tools. We annotate 314 tool-use dialogues with 753
API calls to assess the existing LLMs' capabilities in planning, retrieving,
and calling APIs. For the second question, we construct a comprehensive
training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000
distinct domains. Using this dataset, we train Lynx, a tool-augmented LLM
initialized from Alpaca. Experimental results demonstrate that GPT-3.5 exhibits
improved tool utilization compared to GPT-3, while GPT-4 excels in planning.
However, there is still significant potential for further improvement.
Moreover, Lynx surpasses Alpaca's tool utilization performance by more than 26
pts and approaches the effectiveness of GPT-3.5. Through error analysis, we
highlight the key challenges for future research in this field to answer the
third question.
Related papers
- Chain of Tools: Large Language Model is an Automatic Multi-tool Learner [54.992464510992605]
Automatic Tool Chain (ATC) is a framework that enables the large language models (LLMs) to act as a multi-tool user.
To scale up the scope of the tools, we next propose a black-box probing method.
For a comprehensive evaluation, we build a challenging benchmark named ToolFlow.
arXiv Detail & Related papers (2024-05-26T11:40:58Z) - LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [54.954211216847135]
Existing large language models (LLMs) only reach a correctness rate in the range of 30% to 60%.
We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE)
STE orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory.
arXiv Detail & Related papers (2024-03-07T18:50:51Z) - API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs [28.840207102132286]
We focus on the task of identifying, curating, and transforming existing datasets.
We introduce API-BLEND, a large corpora for training and systematic testing of tool-augmented LLMs.
We demonstrate the utility of the API-BLEND dataset for both training and benchmarking purposes.
arXiv Detail & Related papers (2024-02-23T18:30:49Z) - AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls [30.792186243538037]
We introduce AnyTool, a large language model agent designed to revolutionize the utilization of a vast array of tools in addressing user queries.
We utilize over 16,000 APIs from Rapid API, operating under the assumption that a subset of these APIs could potentially resolve the queries.
AnyTool primarily incorporates three elements: an API retriever with a hierarchical structure, a solver aimed at resolving user queries using a selected set of API candidates, and a self-reflection mechanism.
arXiv Detail & Related papers (2024-02-06T18:59:57Z) - Efficient Tool Use with Chain-of-Abstraction Reasoning [65.18096363216574]
Large language models (LLMs) need to ground their reasoning to real-world knowledge.
There remains challenges for fine-tuning LLM agents to invoke tools in multi-step reasoning problems.
We propose a new method for LLMs to better leverage tools in multi-step reasoning.
arXiv Detail & Related papers (2024-01-30T21:53:30Z) - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized
Toolsets [75.64181719386497]
We present CRAFT, a tool creation and retrieval framework for large language models (LLMs)
It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks.
Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning.
arXiv Detail & Related papers (2023-09-29T17:40:26Z) - ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world
APIs [104.37772295581088]
Open-source large language models (LLMs), e.g., LLaMA, remain significantly limited in tool-use capabilities.
We introduce ToolLLM, a general tool-usetuning encompassing data construction, model training, and evaluation.
We first present ToolBench, an instruction-tuning framework for tool use, which is constructed automatically using ChatGPT.
arXiv Detail & Related papers (2023-07-31T15:56:53Z) - ToolQA: A Dataset for LLM Question Answering with External Tools [14.408707186450899]
Large Language Models (LLMs) have demonstrated impressive performance in various NLP tasks.
They still suffer from challenges such as hallucination and weak numerical reasoning.
To overcome these challenges, external tools can be used to enhance LLMs' question-answering abilities.
arXiv Detail & Related papers (2023-06-23T05:43:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.