ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of
Large Language Models in Real-world Scenarios
- URL: http://arxiv.org/abs/2401.00741v2
- Date: Sun, 14 Jan 2024 15:06:18 GMT
- Title: ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of
Large Language Models in Real-world Scenarios
- Authors: Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian
Li, Xiaoran Fan, Shihan Dou, Qi Zhang, Tao Gui, Xuanjing Huang
- Abstract summary: We propose ToolEyes, a fine-grained system tailored for the evaluation of large language models' tool learning capabilities in authentic scenarios.
The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning.
ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary between LLMs and the physical world.
- Score: 48.38419686697733
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing evaluations of tool learning primarily focus on validating the
alignment of selected tools for large language models (LLMs) with expected
outcomes. However, these approaches rely on a limited set of scenarios where
answers can be pre-determined, diverging from genuine needs. Furthermore, a
sole emphasis on outcomes disregards the intricate capabilities essential for
LLMs to effectively utilize tools. To tackle this issue, we propose ToolEyes, a
fine-grained system tailored for the evaluation of the LLMs' tool learning
capabilities in authentic scenarios. The system meticulously examines seven
real-world scenarios, analyzing five dimensions crucial to LLMs in tool
learning: format alignment, intent comprehension, behavior planning, tool
selection, and answer organization. Additionally, ToolEyes incorporates a tool
library boasting approximately 600 tools, serving as an intermediary between
LLMs and the physical world. Evaluations involving ten LLMs across three
categories reveal a preference for specific scenarios and limited cognitive
abilities in tool learning. Intriguingly, expanding the model size even
exacerbates the hindrance to tool learning. These findings offer instructive
insights aimed at advancing the field of tool learning. The data is available
att https://github.com/Junjie-Ye/ToolEyes.
Related papers
- What Affects the Stability of Tool Learning? An Empirical Study on the Robustness of Tool Learning Frameworks [33.51887014808227]
This paper explores the impact of both internal and external factors on the performance of tool learning frameworks.
We find several insightful conclusions for future work, including the observation that LLMs can benefit significantly from increased trial and exploration.
arXiv Detail & Related papers (2024-07-03T11:06:05Z) - Tool Learning with Large Language Models: A Survey [60.733557487886635]
Tool learning with large language models (LLMs) has emerged as a promising paradigm for augmenting the capabilities of LLMs to tackle highly complex problems.
Despite growing attention and rapid advancements in this field, the existing literature remains fragmented and lacks systematic organization.
arXiv Detail & Related papers (2024-05-28T08:01:26Z) - Chain of Tools: Large Language Model is an Automatic Multi-tool Learner [54.992464510992605]
Automatic Tool Chain (ATC) is a framework that enables the large language models (LLMs) to act as a multi-tool user.
To scale up the scope of the tools, we next propose a black-box probing method.
For a comprehensive evaluation, we build a challenging benchmark named ToolFlow.
arXiv Detail & Related papers (2024-05-26T11:40:58Z) - Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models.
Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions.
We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z) - LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [54.954211216847135]
Existing large language models (LLMs) only reach a correctness rate in the range of 30% to 60%.
We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE)
STE orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory.
arXiv Detail & Related papers (2024-03-07T18:50:51Z) - Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models [26.28459880766842]
We propose a decision-aware and generalizable tool-usage framework (DEER)
Specifically, we first construct the tool-usage samples with multiple decision branches via an automatic generation pipeline.
Our proposed DEER is effective and significantly outperforms baselines across various datasets.
arXiv Detail & Related papers (2024-02-26T16:11:03Z) - MetaTool Benchmark for Large Language Models: Deciding Whether to Use
Tools and Which to Use [82.24774504584066]
Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities.
We introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools.
We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools.
arXiv Detail & Related papers (2023-10-04T19:39:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.