API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
- URL: http://arxiv.org/abs/2304.08244v2
- Date: Wed, 25 Oct 2023 06:54:12 GMT
- Title: API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
- Authors: Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang
Yu, Zhoujun Li, Fei Huang, Yongbin Li
- Abstract summary: API-Bank is a groundbreaking benchmark for tool-augmented Large Language Models.
We develop a run evaluation system consisting of 73 API tools.
We construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains.
- Score: 84.45284695156771
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has demonstrated that Large Language Models (LLMs) can
enhance their capabilities by utilizing external tools. However, three pivotal
questions remain unanswered: (1) How effective are current LLMs in utilizing
tools? (2) How can we enhance LLMs' ability to utilize tools? (3) What
obstacles need to be overcome to leverage tools? To address these questions, we
introduce API-Bank, a groundbreaking benchmark, specifically designed for
tool-augmented LLMs. For the first question, we develop a runnable evaluation
system consisting of 73 API tools. We annotate 314 tool-use dialogues with 753
API calls to assess the existing LLMs' capabilities in planning, retrieving,
and calling APIs. For the second question, we construct a comprehensive
training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000
distinct domains. Using this dataset, we train Lynx, a tool-augmented LLM
initialized from Alpaca. Experimental results demonstrate that GPT-3.5 exhibits
improved tool utilization compared to GPT-3, while GPT-4 excels in planning.
However, there is still significant potential for further improvement.
Moreover, Lynx surpasses Alpaca's tool utilization performance by more than 26
pts and approaches the effectiveness of GPT-3.5. Through error analysis, we
highlight the key challenges for future research in this field to answer the
third question.
Related papers
- ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback [11.931584529573176]
Given an instruction, tool-augmented LLMs can interact with various external tools in multiple rounds and provide a final answer.
Previous LLMs were trained on overly detailed instructions, which included API names or parameters, while real users would not explicitly mention these API details.
To address these issues, we constructed a training dataset called MGToolBench, which contains statement and category-level instructions to better reflect real-world scenarios.
arXiv Detail & Related papers (2024-09-23T08:58:48Z) - Efficient and Scalable Estimation of Tool Representations in Vector Space [34.767193045989515]
We present a framework for generating synthetic data for tool retrieval applications and an efficient data-driven tool retrieval strategy using small encoder models.
We create ToolBank, a new tool retrieval dataset that reflects real human user usages.
With these new methods, we achieve improvements of up to 27.28 in Recall@K on the ToolBench dataset and 30.5 in Recall@K on ToolBank.
arXiv Detail & Related papers (2024-09-02T19:39:24Z) - Chain of Tools: Large Language Model is an Automatic Multi-tool Learner [54.992464510992605]
Automatic Tool Chain (ATC) is a framework that enables the large language models (LLMs) to act as a multi-tool user.
To scale up the scope of the tools, we next propose a black-box probing method.
For a comprehensive evaluation, we build a challenging benchmark named ToolFlow.
arXiv Detail & Related papers (2024-05-26T11:40:58Z) - LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [54.954211216847135]
Existing large language models (LLMs) only reach a correctness rate in the range of 30% to 60%.
We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE)
STE orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory.
arXiv Detail & Related papers (2024-03-07T18:50:51Z) - Efficient Tool Use with Chain-of-Abstraction Reasoning [65.18096363216574]
Large language models (LLMs) need to ground their reasoning to real-world knowledge.
There remains challenges for fine-tuning LLM agents to invoke tools in multi-step reasoning problems.
We propose a new method for LLMs to better leverage tools in multi-step reasoning.
arXiv Detail & Related papers (2024-01-30T21:53:30Z) - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized
Toolsets [75.64181719386497]
We present CRAFT, a tool creation and retrieval framework for large language models (LLMs)
It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks.
Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning.
arXiv Detail & Related papers (2023-09-29T17:40:26Z) - ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world
APIs [104.37772295581088]
Open-source large language models (LLMs), e.g., LLaMA, remain significantly limited in tool-use capabilities.
We introduce ToolLLM, a general tool-usetuning encompassing data construction, model training, and evaluation.
We first present ToolBench, an instruction-tuning framework for tool use, which is constructed automatically using ChatGPT.
arXiv Detail & Related papers (2023-07-31T15:56:53Z) - ToolQA: A Dataset for LLM Question Answering with External Tools [14.408707186450899]
Large Language Models (LLMs) have demonstrated impressive performance in various NLP tasks.
They still suffer from challenges such as hallucination and weak numerical reasoning.
To overcome these challenges, external tools can be used to enhance LLMs' question-answering abilities.
arXiv Detail & Related papers (2023-06-23T05:43:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.