Fugu-MT 論文翻訳(概要): HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs

論文の概要: HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs

arxiv url: http://arxiv.org/abs/2606.23238v1
Date: Mon, 22 Jun 2026 12:23:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 16:10:15.182041
Title: HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs
Title（参考訳）: HOLMES: LLMにおける高次論理推論の評価
Authors: Yucheng Wu, Jundong Xu, Mingzhen Ju, Yue Yu, Chenpeng Wang, Haoxuan Li, Liangming Pan,
Abstract要約: HOLMES(Higher-Order Logic Meets real-world Explainable reasoning)は,LLMにおける高階記号推論のための最初の実世界ベンチマークである。高階論理に基づいて構築されたHOLMESは、自然言語問題とHOLの形式化、基礎的真理解、検証可能な推論トレース、法と財務の細かい制御可能な推論要素をペアリングする。実験によると、現在のLLMはHOLMESに苦戦しており、平均精度は50.64%、最高のモデルは59.54%である。
参考スコア（独自算出の注目度）: 37.82259837085897
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Logical reasoning is essential for reliable AI, yet existing benchmarks are largely first-order-logic-centric, focusing on object-level deduction over fixed predicates. This misses many realistic scenarios where models must reason over rules, predicates, functions, constraints, and decision procedures themselves. We introduce HOLMES (Higher-Order Logic Meets real-world Explainable Symbolic reasoning), the first real-world benchmark for higher-order symbolic reasoning in LLMs, containing 1379 instances. Built on higher-order logic, HOLMES pairs natural-language problems with HOL formalizations, ground-truth answers, verifiable reasoning traces, and fine-grained controllable reasoning factors across law and finance. Experiments show that current LLMs still struggle on HOLMES, with an average accuracy of only 50.64% and the best model reaching 59.54%. Our analyses further reveal that high final-answer accuracy can mask shortcut reasoning in conflict-resolution settings, while performance drops sharply under scope-conditioned and compositional reasoning. These findings identify higher-order symbolic reasoning as a key bottleneck for building reliable and verifiable LLMs. The project code and dataset are publicly available at https://github.com/wuyucheng2002/HOLMES.
Abstract（参考訳）: 論理推論は信頼性の高いAIには不可欠だが、既存のベンチマークは主に一階述語中心であり、固定述語よりもオブジェクトレベルの推論に重点を置いている。これは、モデルがルール、述語、関数、制約、決定手順自体を推論しなければならない多くの現実的なシナリオを見逃している。 HOLMES(Higher-Order Logic Meets real-world Explainable Symbolic reasoning)は,LLMにおける高階記号推論のための最初の実世界ベンチマークであり,1379のインスタンスを含む。高階論理に基づいて構築されたHOLMESは、自然言語問題とHOLの形式化、基礎的真理解、検証可能な推論トレース、法と財務の細かい制御可能な推論要素をペアリングする。実験によると、現在のLLMはHOLMESに苦戦しており、平均精度は50.64%、最高のモデルは59.54%である。さらに,コンフリクト分解条件下でのショートカット推論を,スコープ条件および構成的推論下では性能が急激に低下するのに対し,ファイナルアンサーの精度が高い場合には,ショートカット推論をマスクできることを示した。これらの結果から,高次記号推論は信頼性と検証可能なLLM構築の鍵となるボトルネックであると考えられた。プロジェクトコードとデータセットはhttps://github.com/wuyucheng2002/HOLMESで公開されている。

論文の概要: HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs

関連論文リスト