Related papers: Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?

Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?

URL: http://arxiv.org/abs/2406.12307v2
Date: Sun, 29 Sep 2024 05:11:45 GMT
Title: Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?
Authors: Seungbin Yang, ChaeHun Park, Taehee Kim, Jaegul Choo,
Abstract summary: This study examines whether large language models can identify incomplete conditions and appropriately determine when to refrain from using tools. We confirm that most LLMs are challenged to identify the additional information required to utilize specific tools and the absence of appropriate tools.
Score: 33.74511128798095
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent advancements in integrating large language models (LLMs) with tools have allowed the models to interact with real-world environments. However, these \textit{tool-augmented LLMs} often encounter incomplete scenarios when users provide partial information or the necessary tools are unavailable. Recognizing and managing such scenarios is crucial for LLMs to ensure their reliability, but this exploration remains understudied. This study examines whether LLMs can identify incomplete conditions and appropriately determine when to refrain from using tools. To this end, we address a dataset by manipulating instances from two datasets by removing necessary tools or essential information for tool invocation. We confirm that most LLMs are challenged to identify the additional information required to utilize specific tools and the absence of appropriate tools. We further analyze model behaviors in different environments and compare their performance against humans. Our research can contribute to advancing reliable LLMs by addressing scenarios that commonly arise during interactions between humans and LLMs.

Related papers

FamilyTool: A Multi-hop Personalized Tool Use Benchmark [93.80355496575281]
FamilyTool is a benchmark grounded in a family-based knowledge graph (KG) that simulates personalized, multi-hop tool use scenarios.<n> Experiments reveal significant performance gaps in state-of-the-art Large Language Models (LLMs)<n>FamilyTool serves as a critical resource for evaluating and advancing LLM agents' reasoning, adaptability, and scalability in complex, dynamic environments.
arXiv Detail & Related papers (2025-04-09T10:42:36Z)
Benchmarking Failures in Tool-Augmented Language Models [41.94295877935867]
Tool-augmented language models (TaLMs) assume 'perfect' information access and tool availability, which may not hold in the real world.<n>We introduce the FAIL-TALMS benchmark, featuring two major failures: under-specified user queries and non-available tools.<n>We evaluate top-performing proprietary and open-source models, and find all current models except for Claude struggle to recognize missing tools or information.
arXiv Detail & Related papers (2025-03-18T13:04:55Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use. MeCo captures high-level cognitive signals in the representation space, guiding when to invoke tools. Our experiments show that MeCo accurately detects LLMs' internal cognitive signals and significantly improves tool-use decision-making.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)
Self-Training Large Language Models for Tool-Use Without Demonstrations [15.17750971071501]
Large language models (LLMs) remain prone to factual inaccuracies and computational errors. Recent work augmented LLMs with tools to mitigate these shortcomings, but often requires curated gold tool-use demonstrations. This paper investigates whether LLMs can learn to use tools without demonstrations.
arXiv Detail & Related papers (2025-02-09T12:06:10Z)
Tool Unlearning for Tool-Augmented LLMs [14.755831733659699]
Tool-augmented large language models (LLMs) are often trained on datasets of query-response pairs. ToolDelete is the first approach for unlearning tools from tool-augmented LLMs.
arXiv Detail & Related papers (2025-02-03T05:50:55Z)
ACEBench: Who Wins the Match Point in Tool Usage? [68.54159348899891]
ACEBench is a comprehensive benchmark for assessing tool usage in Large Language Models (LLMs)<n>It categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent.<n>It provides a more granular examination of error causes across different data types.
arXiv Detail & Related papers (2025-01-22T12:59:08Z)
Learning to Ask: When LLMs Meet Unclear Instruction [49.256630152684764]
Large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. We evaluate the performance of LLMs tool-use under imperfect instructions, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench. We propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions.
arXiv Detail & Related papers (2024-08-31T23:06:12Z)
WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models [31.742620965039517]
Large Language Models (LLMs) excel in NLP tasks, but still need external tools to extend their ability. We introduce the Whether-or-not tool usage Evaluation benchmark (WTU-Eval) to assess LLMs with eleven datasets. The results of eight LLMs on WTU-Eval reveal that LLMs frequently struggle to determine tool use in general datasets. Fine-tuning Llama2-7B results in a 14% average performance improvement and a 16.8% decrease in incorrect tool usage.
arXiv Detail & Related papers (2024-07-02T12:07:38Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Tool Learning in the Wild: Empowering Language Models as Automatic Tool Agents [56.822238860147024]
Augmenting large language models with external tools has emerged as a promising approach to extend their utility. Previous methods manually parse tool documentation and create in-context demonstrations, transforming tools into structured formats for LLMs to use in their step-by-step reasoning. We propose AutoTools, a framework that enables LLMs to automate the tool-use workflow.
arXiv Detail & Related papers (2024-05-26T11:40:58Z)
Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models. Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions. We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z)
Towards Practical Tool Usage for Continually Learning LLMs [28.62382804829694]
Large language models show an innate skill for solving language based tasks. But their knowledge, stored directly within their parameters, remains static in time. Tool use helps by offloading work to systems that the LLM can access through an interface. But LLMs that use them still must adapt to nonstationary environments for prolonged use.
arXiv Detail & Related papers (2024-04-14T19:45:47Z)
Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models [26.28459880766842]
We propose a decision-aware and generalizable tool-usage framework (DEER) Specifically, we first construct the tool-usage samples with multiple decision branches via an automatic generation pipeline. Our proposed DEER is effective and significantly outperforms baselines across various datasets.
arXiv Detail & Related papers (2024-02-26T16:11:03Z)
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios [93.68764280953624]
UltraTool is a novel benchmark designed to improve and evaluate Large Language Models' ability in tool utilization. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage.
arXiv Detail & Related papers (2024-01-30T16:52:56Z)
EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction [56.02100384015907]
EasyTool is a framework transforming diverse and lengthy tool documentation into a unified and concise tool instruction. It can significantly reduce token consumption and improve the performance of tool utilization in real-world scenarios.
arXiv Detail & Related papers (2024-01-11T15:45:11Z)
ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios [48.38419686697733]
We propose ToolEyes, a fine-grained system tailored for the evaluation of large language models' tool learning capabilities in authentic scenarios. The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning. ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary between LLMs and the physical world.
arXiv Detail & Related papers (2024-01-01T12:49:36Z)
MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use [82.24774504584066]
Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities. We introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools.
arXiv Detail & Related papers (2023-10-04T19:39:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.