Related papers: ToolFuzz -- Automated Agent Tool Testing

ToolFuzz -- Automated Agent Tool Testing

URL: http://arxiv.org/abs/2503.04479v3
Date: Tue, 11 Mar 2025 14:28:13 GMT
Title: ToolFuzz -- Automated Agent Tool Testing
Authors: Ivan Milev, Mislav Balunović, Maximilian Baader, Martin Vechev,
Abstract summary: ToolFuzz is designed to discover two types of errors: (1) user queries leading to tool runtime errors and (2) user queries that lead to incorrect agent responses.<n>We show that ToolFuzz identifies 20x more erroneous inputs compared to the prompt-engineering approaches.
Score: 5.174808367448261
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Model (LLM) Agents leverage the advanced reasoning capabilities of LLMs in real-world applications. To interface with an environment, these agents often rely on tools, such as web search or database APIs. As the agent provides the LLM with tool documentation along the user query, the completeness and correctness of this documentation is critical. However, tool documentation is often over-, under-, or ill-specified, impeding the agent's accuracy. Standard software testing approaches struggle to identify these errors as they are expressed in natural language. Thus, despite its importance, there currently exists no automated method to test the tool documentation for agents. To address this issue, we present ToolFuzz, the first method for automated testing of tool documentations. ToolFuzz is designed to discover two types of errors: (1) user queries leading to tool runtime errors and (2) user queries that lead to incorrect agent responses. ToolFuzz can generate a large and diverse set of natural inputs, effectively finding tool description errors at a low false positive rate. Further, we present two straightforward prompt-engineering approaches. We evaluate all three tool testing approaches on 32 common LangChain tools and 35 newly created custom tools and 2 novel benchmarks to further strengthen the assessment. We find that many publicly available tools suffer from underspecification. Specifically, we show that ToolFuzz identifies 20x more erroneous inputs compared to the prompt-engineering approaches, making it a key component for building reliable AI agents.

Related papers

Prompt Injection Attack to Tool Selection in LLM Agents [74.90338504778781]
We introduce textitToolHijacker, a novel prompt injection attack targeting tool selection in no-box scenarios. ToolHijacker injects a malicious tool document into the tool library to manipulate the LLM agent's tool selection process. We show that ToolHijacker is highly effective, significantly outperforming existing manual-based and automated prompt injection attacks.
arXiv Detail & Related papers (2025-04-28T13:36:43Z)
A Framework for Testing and Adapting REST APIs as LLM Tools [5.758488787763118]
We present a novel testing framework aimed at evaluating and enhancing the readiness of REST APIs to function as tools for agents. Our framework transforms apis as tools, generates comprehensive test cases for the APIs, tests cases into natural language instructions and evaluates the agent's ability t correctly invoke the API and process its inputs and responses.
arXiv Detail & Related papers (2025-04-22T02:52:08Z)
Benchmarking Failures in Tool-Augmented Language Models [41.94295877935867]
Tool-augmented language models (TaLMs) assume 'perfect' information access and tool availability, which may not hold in the real world. We introduce the FAIL-TALMS benchmark, featuring two major failures: under-specified user queries and non-available tools. We evaluate top-performing proprietary and open-source models, and find all current models except for Claude struggle to recognize missing tools or information.
arXiv Detail & Related papers (2025-03-18T13:04:55Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use. MeCo captures high-level cognitive signals in the representation space, guiding when to invoke tools. Our experiments show that MeCo accurately detects LLMs' internal cognitive signals and significantly improves tool-use decision-making.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)
ToolFactory: Automating Tool Generation by Leveraging LLM to Understand REST API Documentations [4.934192277899036]
API documentation often suffers from a lack of standardization, inconsistent schemas, and incomplete information.<n>We developed textbfToolFactory, an open-source pipeline for automating tool generation from unstructured API documents.<n>We also demonstrated ToolFactory by creating a domain-specific AI agent for glycomaterials research.
arXiv Detail & Related papers (2025-01-28T13:42:33Z)
Learning to Ask: When LLM Agents Meet Unclear Instruction [55.65312637965779]
Large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone.<n>We evaluate the performance of LLMs tool-use under imperfect instructions, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench.<n>We propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions.
arXiv Detail & Related papers (2024-08-31T23:06:12Z)
GTA: A Benchmark for General Tool Agents [32.443456248222695]
We design 229 real-world tasks and executable tool chains to evaluate mainstream large language models (LLMs) Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents.
arXiv Detail & Related papers (2024-07-11T17:50:09Z)
Tools Fail: Detecting Silent Errors in Faulty Tools [27.822981272044043]
We introduce a framework for tools which guides us to explore a model's ability to detect "silent" tool. We provide an initial approach to failure recovery with promising results both on a controlled calculator setting and embodied agent planning.
arXiv Detail & Related papers (2024-06-27T14:52:34Z)
Tool Learning in the Wild: Empowering Language Models as Automatic Tool Agents [56.822238860147024]
Augmenting large language models with external tools has emerged as a promising approach to extend their utility.<n>Previous methods manually parse tool documentation and create in-context demonstrations, transforming tools into structured formats for LLMs to use in their step-by-step reasoning.<n>We propose AutoTools, a framework that enables LLMs to automate the tool-use workflow.
arXiv Detail & Related papers (2024-05-26T11:40:58Z)
EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction [56.02100384015907]
EasyTool is a framework transforming diverse and lengthy tool documentation into a unified and concise tool instruction. It can significantly reduce token consumption and improve the performance of tool utilization in real-world scenarios.
arXiv Detail & Related papers (2024-01-11T15:45:11Z)
ControlLLM: Augment Language Models with Tools by Searching on Graphs [97.62758830255002]
We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving real-world tasks. Our framework comprises three key components: (1) a textittask decomposer that breaks down a complex task into clear subtasks with well-defined inputs and outputs; (2) a textitThoughts-on-Graph (ToG) paradigm that searches the optimal solution path on a pre-built tool graph; and (3) an textitexecution engine with a rich toolbox that interprets the solution path and runs the
arXiv Detail & Related papers (2023-10-26T21:57:21Z)
Don't Fine-Tune, Decode: Syntax Error-Free Tool Use via Constrained Decoding [11.51687663492722]
Large language models (LLMs) excel at many tasks but often fail to use external tools due to complicated and unfamiliar syntax constraints. We propose TOOLDEC, a decoding algorithm using finite state machines to force LLMs to follow tool syntax. Experiments show that TOOLDEC eliminates all syntax errors, achieving significantly better performance on various base models and benchmarks.
arXiv Detail & Related papers (2023-10-10T23:37:53Z)
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [104.37772295581088]
Open-source large language models (LLMs), e.g., LLaMA, remain significantly limited in tool-use capabilities. We introduce ToolLLM, a general tool-usetuning encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning framework for tool use, which is constructed automatically using ChatGPT.
arXiv Detail & Related papers (2023-07-31T15:56:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.