Related papers: WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models

WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models

URL: http://arxiv.org/abs/2407.12823v1
Date: Tue, 2 Jul 2024 12:07:38 GMT
Title: WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models
Authors: Kangyun Ning, Yisong Su, Xueqiang Lv, Yuanzhe Zhang, Jian Liu, Kang Liu, Jinan Xu,
Abstract summary: Large Language Models (LLMs) excel in NLP tasks, but still need external tools to extend their ability. We introduce the Whether-or-not tool usage Evaluation benchmark (WTU-Eval) to assess LLMs with eleven datasets. The results of eight LLMs on WTU-Eval reveal that LLMs frequently struggle to determine tool use in general datasets. Fine-tuning Llama2-7B results in a 14% average performance improvement and a 16.8% decrease in incorrect tool usage.
Score: 31.742620965039517
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although Large Language Models (LLMs) excel in NLP tasks, they still need external tools to extend their ability. Current research on tool learning with LLMs often assumes mandatory tool use, which does not always align with real-world situations, where the necessity for tools is uncertain, and incorrect or unnecessary use of tools can damage the general abilities of LLMs. Therefore, we propose to explore whether LLMs can discern their ability boundaries and use tools flexibly. We then introduce the Whether-or-not tool usage Evaluation benchmark (WTU-Eval) to assess LLMs with eleven datasets, where six of them are tool-usage datasets, and five are general datasets. LLMs are prompted to use tools according to their needs. The results of eight LLMs on WTU-Eval reveal that LLMs frequently struggle to determine tool use in general datasets, and LLMs' performance in tool-usage datasets improves when their ability is similar to ChatGPT. In both datasets, incorrect tool usage significantly impairs LLMs' performance. To mitigate this, we also develop the finetuning dataset to enhance tool decision-making. Fine-tuning Llama2-7B results in a 14\% average performance improvement and a 16.8\% decrease in incorrect tool usage. We will release the WTU-Eval benchmark.

Related papers

Self-Training Large Language Models for Tool-Use Without Demonstrations [15.17750971071501]
Large language models (LLMs) remain prone to factual inaccuracies and computational errors. Recent work augmented LLMs with tools to mitigate these shortcomings, but often requires curated gold tool-use demonstrations. This paper investigates whether LLMs can learn to use tools without demonstrations.
arXiv Detail & Related papers (2025-02-09T12:06:10Z)
Tool Unlearning for Tool-Augmented LLMs [14.755831733659699]
Tool-augmented large language models (LLMs) are often trained on datasets of query-response pairs. ToolDelete is the first approach for unlearning tools from tool-augmented LLMs.
arXiv Detail & Related papers (2025-02-03T05:50:55Z)
Learning to Ask: When LLMs Meet Unclear Instruction [49.256630152684764]
Large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. We evaluate the performance of LLMs tool-use under imperfect instructions, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench. We propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions.
arXiv Detail & Related papers (2024-08-31T23:06:12Z)
Can Tool-augmented Large Language Models be Aware of Incomplete Conditions? [33.74511128798095]
This study examines whether large language models can identify incomplete conditions and appropriately determine when to refrain from using tools. We confirm that most LLMs are challenged to identify the additional information required to utilize specific tools and the absence of appropriate tools.
arXiv Detail & Related papers (2024-06-18T06:28:06Z)
Chain of Tools: Large Language Model is an Automatic Multi-tool Learner [54.992464510992605]
Automatic Tool Chain (ATC) is a framework that enables the large language models (LLMs) to act as a multi-tool user. To scale up the scope of the tools, we next propose a black-box probing method. For a comprehensive evaluation, we build a challenging benchmark named ToolFlow.
arXiv Detail & Related papers (2024-05-26T11:40:58Z)
Towards Practical Tool Usage for Continually Learning LLMs [28.62382804829694]
Large language models show an innate skill for solving language based tasks. But their knowledge, stored directly within their parameters, remains static in time. Tool use helps by offloading work to systems that the LLM can access through an interface. But LLMs that use them still must adapt to nonstationary environments for prolonged use.
arXiv Detail & Related papers (2024-04-14T19:45:47Z)
LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [54.954211216847135]
Existing large language models (LLMs) only reach a correctness rate in the range of 30% to 60%. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE) STE orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory.
arXiv Detail & Related papers (2024-03-07T18:50:51Z)
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios [93.68764280953624]
UltraTool is a novel benchmark designed to improve and evaluate Large Language Models' ability in tool utilization. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage.
arXiv Detail & Related papers (2024-01-30T16:52:56Z)
ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios [49.33633818046644]
We propose ToolEyes, a fine-grained system tailored for the evaluation of large language models' tool learning capabilities in authentic scenarios. The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning. ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary between LLMs and the physical world.
arXiv Detail & Related papers (2024-01-01T12:49:36Z)
MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use [82.24774504584066]
Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities. We introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools.
arXiv Detail & Related papers (2023-10-04T19:39:26Z)
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction [41.36474802204914]
GPT4Tools is based on self-instruct to enable open-source LLMs, such as LLaMA and OPT, to use tools. It generates an instruction-following dataset by prompting an advanced teacher with various multi-modal contexts.
arXiv Detail & Related papers (2023-05-30T05:27:21Z)
Large Language Models as Tool Makers [85.00361145117293]
We introduce a closed-loop framework, referred to as LLMs A s Tool Makers (LATM), where LLMs create their own reusable tools for problem-solving. Our approach consists of two phases: 1) tool making: an LLM acts as the tool maker that crafts tools for a set of tasks. 2) tool using: another LLM acts as the tool user, which applies the tool built by the tool maker for problem-solving.
arXiv Detail & Related papers (2023-05-26T17:50:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.