Related papers: OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

URL: http://arxiv.org/abs/2602.15197v1
Date: Mon, 16 Feb 2026 21:26:37 GMT
Title: OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction
Authors: Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape, Yuhao Zhang, Jack Hessel,
Abstract summary: Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks.<n>While most existing benchmarks assume simple, perfectly documented tools, real-world tools are often opaque, lacking clear best practices or failure modes.<n>We propose ToolObserver, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories.
Score: 41.38214226411103
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks. While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general "search" APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque tools by interacting and subsequently improving documentation? To study this, we create OpaqueToolsBench, a benchmark consisting of three distinct task-oriented environments: general function calling, interactive chess playing, and long-trajectory agentic search. Each environment provides underspecified tools that models must learn to use effectively to complete the task. Results on OpaqueToolsBench suggest existing methods for automatically documenting tools are expensive and unreliable when tools are opaque. To address this, we propose a simple framework, ToolObserver, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories. Our approach outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings. Furthermore, for test-time tool exploration settings, our method is also efficient, consuming 3.5-7.5x fewer total tokens than the best baseline.

Related papers

Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use [21.666294374943178]
We propose a curriculum learning framework that transfers supervision from trace-rich settings to trace-free deployment.<n> Experiments show consistent gains on unseen tools, strong cross-domain generalization, and robustness as the number of candidate tools scales to over 100.
arXiv Detail & Related papers (2026-02-23T23:50:24Z)
ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents [16.06309106596998]
ToolTok is a novel paradigm of multi-step pathfinding for GUI agents.<n>We devise tools aligned with human interaction habits and represent each tool using learnable token embeddings.<n>We construct an easy-to-hard curriculum consisting of three tasks: token definition question-answering, pure text-guided tool selection, and simplified visual pathfinding.
arXiv Detail & Related papers (2026-01-30T08:38:05Z)
Teaching LLMs to Learn Tool Trialing and Execution through Environment Interaction [31.689383152872534]
ToolMaster is a framework that shifts tool use from imitating golden tool-calling trajectories to actively learning tool usage through interaction with the environment.<n>To optimize LLMs for tool planning and invocation, ToolMaster adopts a trial-and-execution paradigm.<n> Experimental results demonstrate that ToolMaster significantly outperforms existing baselines in terms of generalization and robustness across unseen or unfamiliar tools.
arXiv Detail & Related papers (2026-01-19T06:46:33Z)
TheMCPCompany: Creating General-purpose Agents with Task-specific Tools [12.249551019598442]
We introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services.<n>We also provide manually annotated ground-truth tools for each task.<n>Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments.
arXiv Detail & Related papers (2025-10-22T06:42:01Z)
Tool Learning in the Wild: Empowering Language Models as Automatic Tool Agents [56.822238860147024]
Augmenting large language models with external tools has emerged as a promising approach to extend their utility.<n>Previous methods manually parse tool documentation and create in-context demonstrations, transforming tools into structured formats for LLMs to use in their step-by-step reasoning.<n>We propose AutoTools, a framework that enables LLMs to automate the tool-use workflow.
arXiv Detail & Related papers (2024-05-26T11:40:58Z)
Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models. Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions. We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z)
EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction [56.02100384015907]
EasyTool is a framework transforming diverse and lengthy tool documentation into a unified and concise tool instruction. It can significantly reduce token consumption and improve the performance of tool utilization in real-world scenarios.
arXiv Detail & Related papers (2024-01-11T15:45:11Z)
ControlLLM: Augment Language Models with Tools by Searching on Graphs [97.62758830255002]
We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving real-world tasks. Our framework comprises three key components: (1) a textittask decomposer that breaks down a complex task into clear subtasks with well-defined inputs and outputs; (2) a textitThoughts-on-Graph (ToG) paradigm that searches the optimal solution path on a pre-built tool graph; and (3) an textitexecution engine with a rich toolbox that interprets the solution path and runs the
arXiv Detail & Related papers (2023-10-26T21:57:21Z)
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models [90.96816639172464]
Large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool's usage. We advocate the use of tool documentation, descriptions for the individual tool usage, over demonstrations.
arXiv Detail & Related papers (2023-08-01T17:21:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.