Fugu-MT 論文翻訳(概要): QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

論文の概要: QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

arxiv url: http://arxiv.org/abs/2606.20227v1
Date: Thu, 18 Jun 2026 13:40:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-19 18:23:39.882985
Title: QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation
Title（参考訳）: QMFOL: 定量化モナディック一階論理テストケース生成による大規模言語モデル推論のベンチマーク
Authors: Xinyi Zheng, Ling Shi, Tianlong Yu, Yongxin Zhao, Lorenz Goette, Kailong Wang,
Abstract要約: 本稿では,モナディックな一階述語論理推論タスクを生成するためのフレームワークQMFOLを提案する。結合パターンと解離パターンを用いて形式的な論理構造を構築し、推論の深さ、幅、ラベルタイプ、およびイントラクタの正確な制御を可能にする。 QMFOLBenchは2880のインスタンスと960のコンフィギュレーションを、さまざまな論理的、セマンティックな側面で構成したベンチマークです。
参考スコア（独自算出の注目度）: 7.42425368511977
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have made significant progress in reasoning, particularly in deductive reasoning, which is crucial for high-stakes decision-making. As models improve, evaluation benchmarks should evolve to keep pace. However, existing benchmarks lack fine-grained control over logical complexity and struggle to balance semantic diversity with logical consistency. To address these issues, we propose QMFOL, an automated framework for generating monadic first-order logic reasoning tasks with quantifiable and controllable complexity. It constructs formal logical structures using conjunction and disjunction patterns, enabling precise control over reasoning depth, width, label types, and distractors. These structures are then translated into natural language via LLMs, with logical consistency ensured through round-trip verification using an external prover. Based on our framework, we build QMFOLBench, a benchmark comprising 2880 instances with 960 configurations across diverse logical and semantic dimensions. Evaluations on six large reasoning models (LRMs) and two LLMs show that performance degrades and computational overhead increases with rising logical complexity. Models perform better on True-labeled tasks than on False or Unknown ones, and exhibit sensitivity to semantic variation. Overall, QMFOL offers a scalable and reliable approach for constructing deductive reasoning benchmarks with controllable complexity, enabling more precise evaluation of reasoning capabilities in modern language models.
Abstract（参考訳）: 大規模言語モデル (LLMs) は推論、特に帰納的推論において大きな進歩を遂げており、これは高い意思決定に不可欠である。モデルが改善されるにつれて、評価ベンチマークはペースを維持するために進化するべきである。しかし、既存のベンチマークでは、論理的な複雑さに対するきめ細かい制御がなく、意味的な多様性と論理的な一貫性のバランスがとれていない。これらの問題に対処するために、QMFOLを提案する。QMFOLは、定量化と制御可能な複雑さを伴う一階述語論理推論タスクを生成する自動化フレームワークである。結合パターンと解離パターンを用いて形式的な論理構造を構築し、推論の深さ、幅、ラベルタイプ、およびイントラクタの正確な制御を可能にする。これらの構造はLLMを通して自然言語に変換され、外部証明器を用いたラウンドトリップ検証によって論理的一貫性が保証される。 QMFOLBenchは2880のインスタンスと960のコンフィギュレーションを、さまざまな論理的、セマンティックな側面で構成したベンチマークです。 6つの大推理モデル (LRM) と2つの LLM の評価により, 性能劣化と計算オーバーヘッドが増大し, 論理的複雑性が増大することを示した。モデルは False や Unknown よりもTrue-labeled タスクの方が優れており、セマンティックなバリエーションに敏感である。全体として、QMFOLは、制御可能な複雑性を持つ推論ベンチマークを構築するためのスケーラブルで信頼性の高いアプローチを提供し、現代の言語モデルにおける推論能力をより正確に評価することを可能にする。

論文の概要: QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

関連論文リスト