Fugu-MT 論文翻訳(概要): Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

論文の概要: Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

arxiv url: http://arxiv.org/abs/2604.02709v1
Date: Fri, 03 Apr 2026 04:06:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 17:20:24.316197
Title: Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
Title（参考訳）: チョムスキー階層による大規模言語モデルの形式推論能力の評価
Authors: Yihong Dong, Xiaoha Jian, Xue Jiang, Xuyuan Guo, Zhiyuan Fan, Jiaru Qian, Kechi Zhang, Jia Li, Zhi Jin, Ge Li,
Abstract要約: SOTA LLMが形式言語の構造的・階層的複雑性を把握できるかどうかは不明である。 ChomskyBench はchomsky Hierarchy のレンズを通して LLM を体系的に評価するためのベンチマークである。 ChomskyBenchは、各レベルで機能をテストするように設計された、言語認識と生成タスクの包括的なスイートで構成されている。
参考スコア（独自算出の注目度）: 62.32144504442516
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systematically evaluating LLMs through the lens of Chomsky Hierarchy. Unlike prior work that uses vectorized classification for neural networks, ChomskyBench is the first to combine full Chomsky Hierarchy coverage, process-trace evaluation via natural language, and deterministic symbolic verifiability. ChomskyBench is composed of a comprehensive suite of language recognition and generation tasks designed to test capabilities at each level. Extensive experiments indicate a clear performance stratification that correlates with the hierarchy's levels of complexity. Our analysis reveals a direct relationship where increasing task difficulty substantially impacts both inference length and performance. Furthermore, we find that while larger models and advanced inference methods offer notable relative gains, they face severe efficiency barriers: achieving practical reliability would require prohibitive computational costs, revealing that current limitations stem from inefficiency rather than absolute capability bounds. A time complexity analysis further indicates that LLMs are significantly less efficient than traditional algorithmic programs for these formal tasks. These results delineate the practical limits of current LLMs, highlight the indispensability of traditional software tools, and provide insights to guide the development of future LLMs with more powerful formal reasoning capabilities.
Abstract（参考訳）: LLMの正式な推論能力は、自動化されたソフトウェア工学の進歩に不可欠である。しかし、LLMの既存のベンチマークには計算と複雑性に基づく体系的な評価が欠けており、それらの公式な推論能力を理解する上で重要なギャップが残されている。したがって、SOTA LLMが計算理論によって定義される形式言語の構造化された階層的複雑性を把握できるかどうかは不明である。これを解決するために,チョムスキー階層のレンズを用いてLLMを体系的に評価するベンチマークであるチョムスキーベンチを紹介する。ニューラルネットワークにベクトル化分類を使用した以前の研究とは異なり、チョムスキーベンチは、完全なチョムスキー階層カバレッジ、自然言語によるプロセストレース評価、決定論的シンボリック検証を初めて組み合わせている。 ChomskyBenchは、各レベルで機能をテストするように設計された、言語認識と生成タスクの包括的なスイートで構成されている。大規模な実験は、階層の複雑さのレベルと相関する明確なパフォーマンスの成層化を示している。分析の結果,タスクの難易度の増加が推論長と性能の両方に大きな影響を及ぼす直接的な関係が明らかとなった。さらに、より大規模なモデルや高度な推論手法は、顕著な相対的な利益をもたらすが、それらは深刻な効率の障壁に直面している。時間複雑性解析により、LLMはこれらのフォーマルなタスクに対する従来のアルゴリズムプログラムよりもはるかに効率が良くないことが示されている。これらの結果は、現在のLLMの実用的限界を明確にし、従来のソフトウェアツールの欠如を強調し、より強力な形式的推論能力で将来のLLMの開発を導くための洞察を提供する。

論文の概要: Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

関連論文リスト