Related papers: Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents

Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents

URL: http://arxiv.org/abs/2601.20412v1
Date: Wed, 28 Jan 2026 09:17:51 GMT
Title: Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents
Authors: Qihao Wang, Yue Hu, Mingzhe Lu, Jiayue Wu, Yanbing Liu, Yuanmin Tang,
Abstract summary: We introduce a framework grounded in Cognitive Load Theory to move from simple performance scoring to a diagnostic tool.<n>Our framework deconstructs task complexity into two quantifiable components: Intrinsic Load and Extraneous Load.<n>Our evaluation reveals distinct performance cliffs as cognitive load increases, allowing us to precisely map each model's capability boundary.
Score: 11.65679508751598
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The ability of Large Language Models (LLMs) to use external tools unlocks powerful real-world interactions, making rigorous evaluation essential. However, current benchmarks primarily report final accuracy, revealing what models can do but obscuring the cognitive bottlenecks that define their true capability boundaries. To move from simple performance scoring to a diagnostic tool, we introduce a framework grounded in Cognitive Load Theory. Our framework deconstructs task complexity into two quantifiable components: Intrinsic Load, the inherent structural complexity of the solution path, formalized with a novel Tool Interaction Graph; and Extraneous Load, the difficulty arising from ambiguous task presentation. To enable controlled experiments, we construct ToolLoad-Bench, the first benchmark with parametrically adjustable cognitive load. Our evaluation reveals distinct performance cliffs as cognitive load increases, allowing us to precisely map each model's capability boundary. We validate that our framework's predictions are highly calibrated with empirical results, establishing a principled methodology for understanding an agent's limits and a practical foundation for building more efficient systems.

Related papers

From Perception to Action: An Interactive Benchmark for Vision Reasoning [51.11355591375073]
Causal Hierarchy of Actions and Interactions (CHAIN) benchmark designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints.<n> CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing.<n>Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions.
arXiv Detail & Related papers (2026-02-24T15:33:02Z)
Towards Worst-Case Guarantees with Scale-Aware Interpretability [58.519943565092724]
Neural networks organize information according to the hierarchical, multi-scale structure of natural data.<n>We propose a unifying research agenda -- emphscale-aware interpretability -- to develop formal machinery and interpretability tools.
arXiv Detail & Related papers (2026-02-05T01:22:31Z)
SEAL: Self-Evolving Agentic Learning for Conversational Question Answering over Knowledge Graphs [28.59157823781425]
SEAL is a novel two-stage semantic parsing framework grounded in self-evolving agentic learning.<n> SEAL achieves state-of-the-art performance, especially in multi-hop reasoning, comparison, and aggregation tasks.<n>The results validate notable gains in both structural accuracy and computational efficiency.
arXiv Detail & Related papers (2025-12-04T14:52:30Z)
Toward a unified framework for data-efficient evaluation of large language models [12.922829524961813]
LEGO-IRT is a unified and flexible framework for data-efficient large language models evaluation.<n>It supports both binary and continuous evaluation metrics.<n>We show that LEGO-IRT achieves stable capability estimates using just $3%$ of the total evaluation items.
arXiv Detail & Related papers (2025-10-05T06:13:50Z)
Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning [0.0]
Large Language Models (LLMs) excel at isolated tasks, but their reasoning under cognitive load remains poorly understood.<n>We introduce a formal theory of computational cognitive load, positing that extraneous, task-irrelevant information (Context Saturation) and interference from task-switching are key mechanisms that degrade performance.
arXiv Detail & Related papers (2025-09-23T19:36:56Z)
Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates [56.73907811047611]
Large language models (LLMs) have demonstrated strong reasoning and tool-use capabilities.<n>LLMs often fail in real-world tool-interactions due to incorrect parameterization, poor tool selection, or misinterpretation of user intent.<n>We introduce a curriculum-inspired framework that leverages structured reasoning templates to guide LLMs through more deliberate step-by-step instructions for generating function callings.
arXiv Detail & Related papers (2025-09-22T17:55:14Z)
Understanding Tool-Integrated Reasoning [9.235747697967984]
We study why Tool-Integrated Reasoning makes Large Language Models (LLMs) more capable.<n>LLMs integrated with tools like Python code interpreters show great promise, but a principled theory explaining why this paradigm is effective has been missing.<n>We demonstrate that tools enable a strict expansion of the model's empirical and feasible support, breaking the capability ceiling of pure-text models.
arXiv Detail & Related papers (2025-08-26T17:03:46Z)
Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments [70.42705564227548]
We propose an automated environment construction pipeline for large language models (LLMs)<n>This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools.<n>We also introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution.
arXiv Detail & Related papers (2025-08-12T09:45:19Z)
Computational Reasoning of Large Language Models [51.629694188014064]
We introduce textbfTuring Machine Bench, a benchmark to assess the ability of Large Language Models (LLMs) to execute reasoning processes.<n> TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a theoretical foundation based on Turing machine.
arXiv Detail & Related papers (2025-04-29T13:52:47Z)
FamilyTool: A Multi-hop Personalized Tool Use Benchmark [93.80355496575281]
FamilyTool is a benchmark grounded in a family-based knowledge graph (KG) that simulates personalized, multi-hop tool use scenarios.<n> Experiments reveal significant performance gaps in state-of-the-art Large Language Models (LLMs)<n>FamilyTool serves as a critical resource for evaluating and advancing LLM agents' reasoning, adaptability, and scalability in complex, dynamic environments.
arXiv Detail & Related papers (2025-04-09T10:42:36Z)
Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [41.19330514054401]
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems to harmonize reliability and usability.
arXiv Detail & Related papers (2025-03-04T03:16:02Z)
A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.