Related papers: Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning

Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning

URL: http://arxiv.org/abs/2509.19517v2
Date: Thu, 25 Sep 2025 21:42:07 GMT
Title: Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning
Authors: Sai Teja Reddy Adapala,
Abstract summary: Large Language Models (LLMs) excel at isolated tasks, but their reasoning under cognitive load remains poorly understood.<n>We introduce a formal theory of computational cognitive load, positing that extraneous, task-irrelevant information (Context Saturation) and interference from task-switching are key mechanisms that degrade performance.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The scaling of Large Language Models (LLMs) has exposed a critical gap between their performance on static benchmarks and their fragility in dynamic, information-rich environments. While models excel at isolated tasks, the computational limits that govern their reasoning under cognitive load remain poorly understood. In this work, we introduce a formal theory of computational cognitive load, positing that extraneous, task-irrelevant information (Context Saturation) and interference from task-switching (Attentional Residue) are key mechanisms that degrade performance. We designed the Interleaved Cognitive Evaluation (ICE), a deconfounded benchmark to systematically manipulate these load factors on challenging multi-hop reasoning tasks. A comprehensive study (N = 10 replications per item across 200 questions) revealed significant performance variations across five instruction-tuned models. Smaller open-source architectures (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.2) exhibited baseline brittleness, achieving 0% accuracy (SEM = 0.0) across all conditions, including clean controls, on this high-intrinsic-load task. In contrast, Gemini-2.0-Flash-001 showed partial resilience, achieving 85% accuracy in control conditions, with a statistically significant degradation under context saturation ($\beta = -0.003$ per % load, $p < 0.001$). These findings provide preliminary evidence that cognitive load is a key contributor to reasoning failures, supporting theories of hallucination-as-guessing under uncertainty. We conclude that dynamic, cognitive-aware stress testing, as exemplified by the ICE benchmark, is essential for evaluating the true resilience and safety of advanced AI systems.

Related papers

Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models [15.849480549367684]
We propose DAGVul, a novel framework that models vulnerability reasoning as a Directed Acyclic Graph (DAG) generation task.<n>By further introducing Reinforcement Learning with Verifiable Rewards (RLVR), we align model reasoning trace with program-intrinsic logic.<n>Our framework improves the reasoning F1-score by an average of 18.9% over all the baselines.
arXiv Detail & Related papers (2026-02-06T13:19:45Z)
Understanding Degradation with Vision Language Model [56.09241449206817]
Understanding visual degradations is a critical yet challenging problem in computer vision.<n>We introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning.<n>We also introduce textbfDU-110k, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations.
arXiv Detail & Related papers (2026-02-04T13:51:15Z)
Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure [2.0017902634527194]
We introduce Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation.<n>Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining.
arXiv Detail & Related papers (2026-01-15T16:28:14Z)
Lost in the Noise: How Reasoning Models Fail with Contextual Distractors [57.31788955167306]
Recent advances in reasoning models and agentic AI systems have led to an increased reliance on diverse external information.<n>We introduce NoisyBench, a comprehensive benchmark that systematically evaluates model robustness across 11 datasets in RAG, reasoning, alignment, and tool-use tasks.<n>Our evaluation reveals a catastrophic performance drop of up to 80% in state-of-the-art models when faced with contextual distractors.
arXiv Detail & Related papers (2026-01-12T05:43:51Z)
Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements [78.87065404966002]
Existing benchmarks predominantly curate questions at the question level.<n>We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up.
arXiv Detail & Related papers (2025-12-31T13:55:54Z)
The Drill-Down and Fabricate Test (DDFT): A Protocol for Measuring Epistemic Robustness in Language Models [0.0]
Current language model evaluations measure what models know under ideal conditions but not how robustly they know it under realistic stress.<n>We introduce the Drill-Down Fabricate Test (DDFT), a protocol that measures robustness.<n>We find flagship models exhibit brittleness despite their scale, while smaller models can achieve robust performance.
arXiv Detail & Related papers (2025-12-29T20:29:09Z)
Hierarchical Evaluation of Software Design Capabilities of Large Language Models of Code [7.897548449569687]
Large language models (LLMs) are increasingly adopted in software engineering domain, yet robustness of their grasp on core design concepts remains unclear.<n>We generate poorly designed software fragments under various levels of guidance.<n> Reasoning about coupling proves brittle; performance collapses in noisy, open-ended scenarios.<n> Reasoning-trace analysis confirms these failure modes, revealing textitcognitive shortcutting for coupling versus a more exhaustive (yet still failing) analysis for cohesion.
arXiv Detail & Related papers (2025-11-25T23:50:00Z)
SG-OIF: A Stability-Guided Online Influence Framework for Reliable Vision Data [6.4391040754741296]
In this paper, we introduce a Stability-Guided Online Influence Framework (SG-OIF) for Approximating training-point influence on test predictions.<n>We show that SG-OIF achieves 91.1% accuracy in the top 1% prediction samples on the CIFAR-10, and 99.8% AUPR score on MNIST.
arXiv Detail & Related papers (2025-11-21T19:58:54Z)
OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)
It Only Gets Worse: Revisiting DL-Based Vulnerability Detectors from a Practical Perspective [14.271145160443462]
VulTegra compares scratch-trained and pre-trained DL models for vulnerability detection.<n>State-of-the-art (SOTA) detectors still suffer from low consistency, limited real-world capabilities, and scalability challenges.
arXiv Detail & Related papers (2025-07-13T08:02:56Z)
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning [95.28059121743831]
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks.<n>We introduce a Self-aware Weakness-driven problem Synthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation.<n>SwS enables robust generalization byempowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models.
arXiv Detail & Related papers (2025-06-10T17:02:00Z)
Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [48.15636223774418]
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems to harmonize reliability and usability.
arXiv Detail & Related papers (2025-03-04T03:16:02Z)
AskToAct: Enhancing LLMs Tool Use via Self-Correcting Clarification [25.27444694706659]
We present AskToAct, which exploits the structural mapping between queries and their tool invocation solutions.<n>By systematically removing key parameters from queries while retaining them as ground truth, we enable automated construction of high-quality training data.<n>Our framework exhibits robust performance across different model architectures and successfully generalizes to entirely unseen APIs without additional training.
arXiv Detail & Related papers (2025-03-03T12:55:49Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.