ToolTalk: Evaluating Tool-Usage in a Conversational Setting
- URL: http://arxiv.org/abs/2311.10775v1
- Date: Wed, 15 Nov 2023 23:50:31 GMT
- Title: ToolTalk: Evaluating Tool-Usage in a Conversational Setting
- Authors: Nicholas Farn and Richard Shin
- Abstract summary: This paper introduces ToolTalk, a benchmark consisting of complex user intents requiring multi-step tool usage specified through dialogue.
ToolTalk contains 28 tools grouped into 7 plugins, and includes a complete simulated implementation of each tool.
We evaluate GPT-3.5 and GPT-4 on ToolTalk resulting in success rates of 26% and 50% respectively.
- Score: 6.792842055445584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have displayed massive improvements in reasoning
and decision-making skills and can hold natural conversations with users. Many
recent works seek to augment LLM-based assistants with external tools so they
can access private or up-to-date information and carry out actions on behalf of
users. To better measure the performance of these assistants, this paper
introduces ToolTalk, a benchmark consisting of complex user intents requiring
multi-step tool usage specified through dialogue. ToolTalk contains 28 tools
grouped into 7 plugins, and includes a complete simulated implementation of
each tool, allowing for fully automated evaluation of assistants that rely on
execution feedback. ToolTalk also emphasizes tools that externally affect the
world rather than only tools for referencing or searching information. We
evaluate GPT-3.5 and GPT-4 on ToolTalk resulting in success rates of 26% and
50% respectively. Our analysis of the errors reveals three major categories and
suggests some future directions for improvement. We release ToolTalk at
https://github.com/microsoft/ToolTalk.
Related papers
- OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning [47.51937366171448]
OctoTools is a training-free, user-friendly, and easily open-source agentic framework designed to tackle complex reasoning across diverse domains.
We validate OctoTools' generality across 16 diverse tasks, achieving substantial average accuracy gains of 9.3% over GPT-4o.
OctoTools outperforms AutoGen, GPT-Functions and LangChain by up to 10.6% when given the same set of tools.
arXiv Detail & Related papers (2025-02-16T21:18:47Z) - ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use [51.43211624452462]
We present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools.
ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers.
We evaluate 14 LLMs across five model families, uncovering significant challenges in handling multi-hop tool-use scenarios.
arXiv Detail & Related papers (2025-01-05T11:06:55Z) - Enhancing Tool Retrieval with Iterative Feedback from Large Language Models [9.588592185027455]
Large language models (LLMs) can effectively handle a certain amount of tools through in-context learning or fine-tuning.
In real-world scenarios, the number of tools is typically extensive and irregularly updated, emphasizing the necessity for a dedicated tool retrieval component.
We propose to enhance tool retrieval with iterative feedback from the large language model.
arXiv Detail & Related papers (2024-06-25T11:12:01Z) - TOOLVERIFIER: Generalization to New Tools via Self-Verification [69.85190990517184]
We introduce a self-verification method which distinguishes between close candidates by self-asking contrastive questions during tool selection.
Experiments on 4 tasks from the ToolBench benchmark, consisting of 17 unseen tools, demonstrate an average improvement of 22% over few-shot baselines.
arXiv Detail & Related papers (2024-02-21T22:41:38Z) - MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use [79.87054552116443]
Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities.
We introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools.
We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools.
arXiv Detail & Related papers (2023-10-04T19:39:26Z) - ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world
APIs [104.37772295581088]
Open-source large language models (LLMs), e.g., LLaMA, remain significantly limited in tool-use capabilities.
We introduce ToolLLM, a general tool-usetuning encompassing data construction, model training, and evaluation.
We first present ToolBench, an instruction-tuning framework for tool use, which is constructed automatically using ChatGPT.
arXiv Detail & Related papers (2023-07-31T15:56:53Z) - ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via
Tool Embeddings [25.5476046472217]
Augmenting large language models with external tools has emerged as a promising approach to solving complex problems.
Recent in-context learning paradigm alleviates these issues, but the limited context length only allows for a few shots of demonstrations.
We propose an alternative approach, $textbfToolkenGPT$, which combines the benefits of both sides.
arXiv Detail & Related papers (2023-05-19T09:54:21Z) - API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs [84.45284695156771]
API-Bank is a groundbreaking benchmark for tool-augmented Large Language Models.
We develop a run evaluation system consisting of 73 API tools.
We construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains.
arXiv Detail & Related papers (2023-04-14T14:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.