Related papers: Benchmarking Failures in Tool-Augmented Language Models

Benchmarking Failures in Tool-Augmented Language Models

URL: http://arxiv.org/abs/2503.14227v1
Date: Tue, 18 Mar 2025 13:04:55 GMT
Title: Benchmarking Failures in Tool-Augmented Language Models
Authors: Eduardo Treviño, Hugo Contant, James Ngai, Graham Neubig, Zora Zhiruo Wang,
Abstract summary: Tool-augmented language models (TaLMs) assume 'perfect' information access and tool availability, which may not hold in the real world.<n>We introduce the FAIL-TALMS benchmark, featuring two major failures: under-specified user queries and non-available tools.<n>We evaluate top-performing proprietary and open-source models, and find all current models except for Claude struggle to recognize missing tools or information.
Score: 41.94295877935867
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The integration of tools has extended the capabilities of language models (LMs) beyond vanilla text generation to versatile scenarios. However, tool-augmented language models (TaLMs) often assume 'perfect' information access and tool availability, which may not hold in the real world. To systematically study TaLMs' imperfections, we introduce the FAIL-TALMS benchmark, featuring two major failures: under-specified user queries and non-available tools. FAIL-TALMS contains 1,749 examples using 906 tools across 21 categories, including single- and multi-tool usage. We evaluate top-performing proprietary and open-source models, and find all current models except for Claude struggle to recognize missing tools or information. Further, to study possible mitigation of the failures, we enable real-time human interaction, named the Ask-and-Help (AAH) method, to provide missing information or replace non-functional tools. While AAH can help models solve tasks more correctly when queries are under-specified, it brings minimal benefit when complex tools are broken.

Related papers

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers [5.463884405989425]
We introduce MCP-Atlas, a large-scale benchmark for evaluating tool-use competency.<n>It includes 1,000 tasks designed to assess tool-use competency in realistic, multi-step orchestrate.<n>We score tasks using a claims-based rubric that awards partial credit based on the factual claims satisfied in the model's final answer.
arXiv Detail & Related papers (2026-01-31T23:19:39Z)
AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning [66.24374176797075]
We introduce textbfAdaReasoner, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior.<n>AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that prioritizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage.
arXiv Detail & Related papers (2026-01-26T16:04:43Z)
Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models [47.145844910856134]
Tool learning aims to augment large language models with diverse tools, enabling them to act as agents for solving practical tasks.<n>Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step.<n>Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios.<n>We propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from
arXiv Detail & Related papers (2025-03-03T17:37:16Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use. MeCo captures high-level cognitive signals in the representation space, guiding when to invoke tools. Our experiments show that MeCo accurately detects LLMs' internal cognitive signals and significantly improves tool-use decision-making.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)
Learning to Ask: When LLM Agents Meet Unclear Instruction [55.65312637965779]
Large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone.<n>We evaluate the performance of LLMs tool-use under imperfect instructions, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench.<n>We propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions.
arXiv Detail & Related papers (2024-08-31T23:06:12Z)
Enhancing Tool Retrieval with Iterative Feedback from Large Language Models [9.588592185027455]
Large language models (LLMs) can effectively handle a certain amount of tools through in-context learning or fine-tuning. In real-world scenarios, the number of tools is typically extensive and irregularly updated, emphasizing the necessity for a dedicated tool retrieval component. We propose to enhance tool retrieval with iterative feedback from the large language model.
arXiv Detail & Related papers (2024-06-25T11:12:01Z)
Can Tool-augmented Large Language Models be Aware of Incomplete Conditions? [33.74511128798095]
This study examines whether large language models can identify incomplete conditions and appropriately determine when to refrain from using tools.<n>Our experiments show that LLMs often struggle to identify the absence of information required to utilize specific tools.<n>Our research can contribute to advancing reliable LLMs by addressing common scenarios during interactions between humans and LLMs.
arXiv Detail & Related papers (2024-06-18T06:28:06Z)
Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models. Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions. We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z)
MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use [79.87054552116443]
Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities. We introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools.
arXiv Detail & Related papers (2023-10-04T19:39:26Z)
Large Language Models as Tool Makers [85.00361145117293]
We introduce a closed-loop framework, referred to as LLMs A s Tool Makers (LATM), where LLMs create their own reusable tools for problem-solving. Our approach consists of two phases: 1) tool making: an LLM acts as the tool maker that crafts tools for a set of tasks. 2) tool using: another LLM acts as the tool user, which applies the tool built by the tool maker for problem-solving.
arXiv Detail & Related papers (2023-05-26T17:50:11Z)
Making Language Models Better Tool Learners with Execution Feedback [36.30542737293863]
Tools serve as pivotal interfaces that enable humans to understand and reshape the environment. Existing tool learning methodologies induce large language models to utilize tools indiscriminately. We propose Tool leaRning wIth exeCution fEedback (TRICE), a two-stage end-to-end framework that enables the model to continually learn through feedback derived from tool execution.
arXiv Detail & Related papers (2023-05-22T14:37:05Z)
TALM: Tool Augmented Language Models [28.483609366116525]
Transformer based language models (LMs) demonstrate increasing performance with scale across a wide variety of tasks. We present Tool Augmented Language Models (TALM), combining a text-only approach to augment language models with non-differentiable tools. TALM exhibits strong performance on both a knowledge-heavy QA task and a reasoning oriented math task with simple tools.
arXiv Detail & Related papers (2022-05-24T17:58:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.