Fugu-MT 論文翻訳(概要): Language Models Do Not Follow Occam's Razor: A Benchmark for Inductive and Abductive Reasoning

論文の概要: Language Models Do Not Follow Occam's Razor: A Benchmark for Inductive and Abductive Reasoning

arxiv url: http://arxiv.org/abs/2509.03345v1
Date: Wed, 03 Sep 2025 14:22:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 21:40:46.546778
Title: Language Models Do Not Follow Occam's Razor: A Benchmark for Inductive and Abductive Reasoning
Title（参考訳）: 言語モデルはOccamのRazorをフォローしない:帰納的推論と帰納的推論のためのベンチマーク
Authors: Yunxin Sun, Abulhair Saparov,
Abstract要約: この研究は、大規模言語モデルの帰納的推論能力と帰納的推論能力を評価することに重点を置いている。プログラム可能で合成可能なデータセットであるInAbHyDを導入し、各推論例は不完全な世界モデルと観測セットから構成される。我々はOccamのRazorに基づく仮説の質を評価するための新しい指標を提案する。
参考スコア（独自算出の注目度）: 6.06071622429429
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning is a core capability in artificial intelligence systems, for which large language models (LLMs) have recently shown remarkable progress. However, most work focuses exclusively on deductive reasoning, which is problematic since other types of reasoning are also essential in solving real-world problems, and they are less explored. This work focuses on evaluating LLMs' inductive and abductive reasoning capabilities. We introduce a programmable and synthetic dataset, InAbHyD (pronounced in-a-bid), where each reasoning example consists of an incomplete world model and a set of observations. The task for the intelligent agent is to produce hypotheses to explain observations under the incomplete world model to solve each reasoning example. We propose a new metric to evaluate the quality of hypotheses based on Occam's Razor. We evaluate and analyze some state-of-the-art LLMs. Our analysis shows that LLMs can perform inductive and abductive reasoning in simple scenarios, but struggle with complex world models and producing high-quality hypotheses, even with popular reasoning-enhancing techniques such as in-context learning and RLVR.
Abstract（参考訳）: 推論(Reasoning)は、人工知能システムにおける中核的な能力であり、近年大きな言語モデル(LLM)が顕著に進歩している。しかし、ほとんどの研究は誘惑的推論にのみ焦点をあてており、これは他のタイプの推論が現実世界の問題を解決する上でも不可欠であるため問題であり、それらは調査されていない。本研究は, LLMの帰納的推論能力と帰納的推論能力を評価することに焦点を当てる。プログラム可能で合成可能なデータセットであるInAbHyD(in-a-bid)を導入し、各推論例は不完全世界モデルと観測セットから構成される。知的エージェントの課題は、不完全な世界モデルの下での観察を説明する仮説を作成し、それぞれの推論の例を解くことである。我々はOccamのRazorに基づく仮説の質を評価するための新しい指標を提案する。我々は、最先端のLCMを評価し、分析する。分析の結果,LLMは単純なシナリオでは帰納的推論や帰納的推論を行うことができるが,複雑な世界モデルと闘い,高品質な仮説を導出する。

論文の概要: Language Models Do Not Follow Occam's Razor: A Benchmark for Inductive and Abductive Reasoning

関連論文リスト