Fugu-MT 論文翻訳(概要): Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning

論文の概要: Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning

arxiv url: http://arxiv.org/abs/2509.19517v2
Date: Thu, 25 Sep 2025 21:42:07 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 12:12:20.321695
Title: Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning
Title（参考訳）: 大規模言語モデルにおける認知的負荷限界:マルチホップ推論のベンチマーク
Authors: Sai Teja Reddy Adapala,
Abstract要約: 大規模言語モデル(LLM)は孤立したタスクにおいて優れるが、認知的負荷下での推論はいまだに理解されていない。本稿では,タスク不適切な情報(コンテキスト飽和)とタスク切替による干渉が,性能を低下させる重要なメカニズムであることを示唆する,計算認知負荷の形式的理論を導入する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The scaling of Large Language Models (LLMs) has exposed a critical gap between their performance on static benchmarks and their fragility in dynamic, information-rich environments. While models excel at isolated tasks, the computational limits that govern their reasoning under cognitive load remain poorly understood. In this work, we introduce a formal theory of computational cognitive load, positing that extraneous, task-irrelevant information (Context Saturation) and interference from task-switching (Attentional Residue) are key mechanisms that degrade performance. We designed the Interleaved Cognitive Evaluation (ICE), a deconfounded benchmark to systematically manipulate these load factors on challenging multi-hop reasoning tasks. A comprehensive study (N = 10 replications per item across 200 questions) revealed significant performance variations across five instruction-tuned models. Smaller open-source architectures (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.2) exhibited baseline brittleness, achieving 0% accuracy (SEM = 0.0) across all conditions, including clean controls, on this high-intrinsic-load task. In contrast, Gemini-2.0-Flash-001 showed partial resilience, achieving 85% accuracy in control conditions, with a statistically significant degradation under context saturation ($\beta = -0.003$ per % load, $p < 0.001$). These findings provide preliminary evidence that cognitive load is a key contributor to reasoning failures, supporting theories of hallucination-as-guessing under uncertainty. We conclude that dynamic, cognitive-aware stress testing, as exemplified by the ICE benchmark, is essential for evaluating the true resilience and safety of advanced AI systems.
Abstract（参考訳）: LLM(Large Language Models)のスケーリングは、静的ベンチマークのパフォーマンスと、動的で情報豊富な環境での脆弱性の間に、重大なギャップを露呈している。モデルは孤立したタスクにおいて排他的であるが、認知的負荷下での推論を規定する計算限界は未だ理解されていない。本研究では,タスク非関連情報(コンテキスト飽和)とタスクスイッチング(アテンショナル残差)の干渉が性能を低下させる重要なメカニズムであることを示唆する,計算認知負荷の形式理論を導入する。マルチホップ推論タスクの課題に対して,これらの負荷要因を体系的に操作する,非統合型ベンチマークであるInterleaved Cognitive Evaluation (ICE) を設計した。総合的な調査(200問にまたがる項目毎の10のレプリケーション)では、5つの命令チューニングモデル間で大きなパフォーマンス変化が示された。より小さなオープンソースアーキテクチャ (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.2) はベースラインの脆さを示し、クリーンな制御を含む全ての条件で0%の精度(SEM = 0.0)を達成した。対照的に、Gemini-2.0-Flash-001は部分的なレジリエンスを示し、制御条件では85%の精度で精度が向上し、文脈飽和下で統計的に有意な劣化が見られた(\beta = -0.003$ per % load, $p < 0.001$)。これらの知見は、認知負荷が障害の推論に重要な要因であり、不確実性の下での幻覚・覚醒の理論を支持していることを示す予備的な証拠である。我々は,高度なAIシステムの真のレジリエンスと安全性を評価する上で,ICEベンチマークが示すように,動的で認知に配慮したストレステストが不可欠である,と結論付けている。

論文の概要: Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning

関連論文リスト