Fugu-MT 論文翻訳(概要): DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

論文の概要: DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

arxiv url: http://arxiv.org/abs/2509.15587v2
Date: Tue, 23 Sep 2025 14:48:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-24 14:02:59.902477
Title: DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models
Title（参考訳）: DivLogicEval: 大規模言語モデルにおける論理推論評価のベンチマークフレームワーク
Authors: Tsz Ting Chung, Lemao Liu, Mo Yu, Dit-Yan Yeung,
Abstract要約: 本稿では,多種多様な文からなる自然文からなる古典論理ベンチマークDivLogicEvalを提案する。また,より信頼性の高い評価を実現するために,大規模言語モデルに固有のバイアスやランダム性の影響を緩和する新たな評価指標を導入する。
参考スコア（独自算出の注目度）: 58.439517684779936
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Logic reasoning in natural language has been recognized as an important measure of human intelligence for Large Language Models (LLMs). Popular benchmarks may entangle multiple reasoning skills and thus provide unfaithful evaluations on the logic reasoning skill. Meanwhile, existing logic reasoning benchmarks are limited in language diversity and their distributions are deviated from the distribution of an ideal logic reasoning benchmark, which may lead to biased evaluation results. This paper thereby proposes a new classical logic benchmark DivLogicEval, consisting of natural sentences composed of diverse statements in a counterintuitive way. To ensure a more reliable evaluation, we also introduce a new evaluation metric that mitigates the influence of bias and randomness inherent in LLMs. Through experiments, we demonstrate the extent to which logical reasoning is required to answer the questions in DivLogicEval and compare the performance of different popular LLMs in conducting logical reasoning.
Abstract（参考訳）: 自然言語における論理的推論は、Large Language Models (LLMs) における人間の知能の重要な尺度として認識されている。人気のあるベンチマークは、複数の推論スキルを絡めて、ロジック推論スキルに関する不誠実な評価を提供する。一方、既存の論理推論ベンチマークは言語の多様性に制限があり、それらの分布は理想的な論理推論ベンチマークの分布から逸脱し、バイアス評価結果につながる可能性がある。そこで本稿では,多様な文からなる自然文からなる古典論理ベンチマークDivLogicEvalを提案する。より信頼性の高い評価を実現するため,LLMに固有のバイアスやランダム性の影響を緩和する新たな評価指標も導入した。実験を通して,DivLogicEvalの質問に対して論理的推論がどの程度必要かを示し,論理的推論を行う上で,様々なLLMの性能を比較した。

論文の概要: DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

関連論文リスト