ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions
- URL: http://arxiv.org/abs/2505.23662v1
- Date: Thu, 29 May 2025 17:10:12 GMT
- Title: ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions
- Authors: Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, Jinyoung Yeo,
- Abstract summary: We introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions.<n>Each test instance includes multiple tasks execution contexts and realistic noise within a continuous conversation.<n>We find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack.
- Score: 9.825432101000358
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.
Related papers
- ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools [9.788417605537965]
We introduce ToolVQA, a large-scale multimodal dataset comprising 23K instances, designed to bridge the gap in real-world tool-use proficiency.<n>ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions.<n>To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a dynamic in-context example matching mechanism.
arXiv Detail & Related papers (2025-08-05T10:06:16Z) - Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges [30.68589269821412]
Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on single-turn interactions.<n>We propose textttDialogTool, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use.
arXiv Detail & Related papers (2025-05-19T16:36:13Z) - Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use.<n>MeCo captures high-level cognitive signals in the representation space, guiding when to invoke tools.<n>Our experiments show that MeCo accurately detects LLMs' internal cognitive signals and significantly improves tool-use decision-making.
arXiv Detail & Related papers (2025-02-18T15:45:01Z) - FREYR: A Framework for Recognizing and Executing Your Requests [2.4797200957733576]
This paper introduces FREYR, a streamlined framework that modularizes the tool usage process into separate steps.<n>We show that FREYR achieves superior performance compared to conventional tool usage methods.<n>We evaluate FREYR on a set of real-world test cases specific for video game design and compare it against traditional tool usage as provided by the Ollama API.
arXiv Detail & Related papers (2025-01-21T11:08:18Z) - ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities [30.030101957186595]
ToolSandbox is an evaluation framework for large language models (LLMs)<n>ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation.<n>We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs.
arXiv Detail & Related papers (2024-08-08T05:45:42Z) - Enhancing Tool Retrieval with Iterative Feedback from Large Language Models [9.588592185027455]
Large language models (LLMs) can effectively handle a certain amount of tools through in-context learning or fine-tuning.
In real-world scenarios, the number of tools is typically extensive and irregularly updated, emphasizing the necessity for a dedicated tool retrieval component.
We propose to enhance tool retrieval with iterative feedback from the large language model.
arXiv Detail & Related papers (2024-06-25T11:12:01Z) - Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models.
Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions.
We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z) - What Are Tools Anyway? A Survey from the Language Model Perspective [67.18843218893416]
Language models (LMs) are powerful yet mostly for text generation tasks.
We provide a unified definition of tools as external programs used by LMs.
We empirically study the efficiency of various tooling methods.
arXiv Detail & Related papers (2024-03-18T17:20:07Z) - TOOLVERIFIER: Generalization to New Tools via Self-Verification [69.85190990517184]
We introduce a self-verification method which distinguishes between close candidates by self-asking contrastive questions during tool selection.
Experiments on 4 tasks from the ToolBench benchmark, consisting of 17 unseen tools, demonstrate an average improvement of 22% over few-shot baselines.
arXiv Detail & Related papers (2024-02-21T22:41:38Z) - Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios [93.68764280953624]
UltraTool is a novel benchmark designed to improve and evaluate Large Language Models' ability in tool utilization.
It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving.
A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage.
arXiv Detail & Related papers (2024-01-30T16:52:56Z) - ART: Automatic multi-step reasoning and tool-use for large language
models [105.57550426609396]
Large language models (LLMs) can perform complex reasoning in few- and zero-shot settings.
Each reasoning step can rely on external tools to support computation beyond the core LLM capabilities.
We introduce Automatic Reasoning and Tool-use (ART), a framework that uses frozen LLMs to automatically generate intermediate reasoning steps as a program.
arXiv Detail & Related papers (2023-03-16T01:04:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.