Related papers: ToolTalk: Evaluating Tool-Usage in a Conversational Setting

ToolTalk: Evaluating Tool-Usage in a Conversational Setting

URL: http://arxiv.org/abs/2311.10775v1
Date: Wed, 15 Nov 2023 23:50:31 GMT
Title: ToolTalk: Evaluating Tool-Usage in a Conversational Setting
Authors: Nicholas Farn and Richard Shin
Abstract summary: This paper introduces ToolTalk, a benchmark consisting of complex user intents requiring multi-step tool usage specified through dialogue. ToolTalk contains 28 tools grouped into 7 plugins, and includes a complete simulated implementation of each tool. We evaluate GPT-3.5 and GPT-4 on ToolTalk resulting in success rates of 26% and 50% respectively.
Score: 6.792842055445584
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Many recent works seek to augment LLM-based assistants with external tools so they can access private or up-to-date information and carry out actions on behalf of users. To better measure the performance of these assistants, this paper introduces ToolTalk, a benchmark consisting of complex user intents requiring multi-step tool usage specified through dialogue. ToolTalk contains 28 tools grouped into 7 plugins, and includes a complete simulated implementation of each tool, allowing for fully automated evaluation of assistants that rely on execution feedback. ToolTalk also emphasizes tools that externally affect the world rather than only tools for referencing or searching information. We evaluate GPT-3.5 and GPT-4 on ToolTalk resulting in success rates of 26% and 50% respectively. Our analysis of the errors reveals three major categories and suggests some future directions for improvement. We release ToolTalk at https://github.com/microsoft/ToolTalk.

Related papers

Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges [30.68589269821412]
Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on single-turn interactions.<n>We propose textttDialogTool, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use.
arXiv Detail & Related papers (2025-05-19T16:36:13Z)
Advancing and Benchmarking Personalized Tool Invocation for LLMs [66.39214525683425]
We introduce the concept of Personalized Tool Invocation and define two key tasks: Tool Preference and Profile-dependent Query.<n>To tackle these challenges, we propose PTool, a data synthesis framework designed for personalized tool invocation.<n>We construct textbfPTBench, the first benchmark for evaluating personalized tool invocation.
arXiv Detail & Related papers (2025-05-07T02:25:20Z)
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning [47.51937366171448]
OctoTools is a training-free, user-friendly, and easily open-source agentic framework designed to tackle complex reasoning across diverse domains. We validate OctoTools' generality across 16 diverse tasks, achieving substantial average accuracy gains of 9.3% over GPT-4o. OctoTools outperforms AutoGen, GPT-Functions and LangChain by up to 10.6% when given the same set of tools.
arXiv Detail & Related papers (2025-02-16T21:18:47Z)
ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use [51.43211624452462]
We present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers. We evaluate 14 LLMs across five model families, uncovering significant challenges in handling multi-hop tool-use scenarios.
arXiv Detail & Related papers (2025-01-05T11:06:55Z)
PTR: Precision-Driven Tool Recommendation for Large Language Models [43.53494041932615]
We propose a Precision-driven Tool Recommendation (PTR) approach for Large Language Models (LLMs) PTR captures an initial, concise set of tools by leveraging historical tool bundle usage and dynamically adjusts the tool set by performing tool matching. We present a new dataset, RecTools, and a metric, TRACC, designed to evaluate the effectiveness of tool recommendation for LLMs.
arXiv Detail & Related papers (2024-11-14T17:33:36Z)
MetaTool: Facilitating Large Language Models to Master Tools with Meta-task Augmentation [25.360660222418183]
We present MetaTool, a novel tool learning methodology designed to generalize across any reusable toolset. By incorporating meta-task data into task-oriented training, our method significantly enhances the performance of open-source Large Language Models.
arXiv Detail & Related papers (2024-07-15T10:15:41Z)
Enhancing Tool Retrieval with Iterative Feedback from Large Language Models [9.588592185027455]
Large language models (LLMs) can effectively handle a certain amount of tools through in-context learning or fine-tuning. In real-world scenarios, the number of tools is typically extensive and irregularly updated, emphasizing the necessity for a dedicated tool retrieval component. We propose to enhance tool retrieval with iterative feedback from the large language model.
arXiv Detail & Related papers (2024-06-25T11:12:01Z)
Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models. Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions. We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z)
TOOLVERIFIER: Generalization to New Tools via Self-Verification [69.85190990517184]
We introduce a self-verification method which distinguishes between close candidates by self-asking contrastive questions during tool selection. Experiments on 4 tasks from the ToolBench benchmark, consisting of 17 unseen tools, demonstrate an average improvement of 22% over few-shot baselines.
arXiv Detail & Related papers (2024-02-21T22:41:38Z)
ControlLLM: Augment Language Models with Tools by Searching on Graphs [97.62758830255002]
We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving real-world tasks. Our framework comprises three key components: (1) a textittask decomposer that breaks down a complex task into clear subtasks with well-defined inputs and outputs; (2) a textitThoughts-on-Graph (ToG) paradigm that searches the optimal solution path on a pre-built tool graph; and (3) an textitexecution engine with a rich toolbox that interprets the solution path and runs the
arXiv Detail & Related papers (2023-10-26T21:57:21Z)
MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use [82.24774504584066]
Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities. We introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools.
arXiv Detail & Related papers (2023-10-04T19:39:26Z)
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [104.37772295581088]
Open-source large language models (LLMs), e.g., LLaMA, remain significantly limited in tool-use capabilities. We introduce ToolLLM, a general tool-usetuning encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning framework for tool use, which is constructed automatically using ChatGPT.
arXiv Detail & Related papers (2023-07-31T15:56:53Z)
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings [25.5476046472217]
Augmenting large language models with external tools has emerged as a promising approach to solving complex problems. Recent in-context learning paradigm alleviates these issues, but the limited context length only allows for a few shots of demonstrations. We propose an alternative approach, $textbfToolkenGPT$, which combines the benefits of both sides.
arXiv Detail & Related papers (2023-05-19T09:54:21Z)
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs [84.45284695156771]
API-Bank is a groundbreaking benchmark for tool-augmented Large Language Models. We develop a run evaluation system consisting of 73 API tools. We construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains.
arXiv Detail & Related papers (2023-04-14T14:05:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.