Fugu-MT 論文翻訳(概要): GEAR: A General Evaluation Framework for Abductive Reasoning

論文の概要: GEAR: A General Evaluation Framework for Abductive Reasoning

arxiv url: http://arxiv.org/abs/2509.24096v1
Date: Sun, 28 Sep 2025 22:22:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.629624
Title: GEAR: A General Evaluation Framework for Abductive Reasoning
Title（参考訳）: GEAR: 帰納的推論のための一般的な評価フレームワーク
Authors: Kaiyu He, Peilin Wu, Mian Zhang, Kun Wan, Wentian Zhao, Xinya Du, Zhiyu Chen,
Abstract要約: GEAR(General Evaluation for Abductive Reasoning)は、汎用的で、完全に自動化され、透明で、ラベルのない評価パラダイムである。 GEARは、仮説セットを3つの指標でスコア付けする: 一貫性(それぞれの仮説が観察を説明する)、一般化可能性(一貫性のある仮説は目に見えない入力について有意義な予測をする)、多様性(セットは異なる予測とパターンをカバーしている)。
参考スコア（独自算出の注目度）: 21.08814504507274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Since the advent of large language models (LLMs), research has focused on instruction following and deductive reasoning. A central question remains: can these models discover new knowledge, and how can we evaluate this ability? We address this by studying abductive reasoning-the generation of plausible hypotheses to explain observations-and introduce GEAR (General Evaluation for Abductive Reasoning), a general-purpose, fully automated, transparent, and label-free evaluation paradigm. GEAR scores hypothesis sets by three metrics: consistency (each hypothesis explains the observations), generalizability (consistent hypotheses make meaningful predictions on unseen inputs), and diversity (the set covers distinct predictions and patterns). Built this way, GEAR is scalable (no human gold answers), reliable (deterministic scoring aligned with classical abduction), and open-ended (scores improve only when models produce new plausible hypotheses, unlike static benchmarks that saturate once accuracy is high). Using GEAR, we conduct a fine-grained study of nine LLMs on four abduction benchmarks with 1,500 problems, generating over 50,000 candidate hypotheses and revealing model differences obscured by gold-answer or purely human evaluations. We further propose a momentum-based curriculum that adjusts GEAR-derived training data by learning velocity: it starts with what the model learns quickly and shifts toward harder objectives such as generating diverse hypotheses once the model is confident on foundational objectives. Without gold-label supervision, this strategy improves all GEAR objectives and these gains transfer to established abductive reasoning benchmarks. Taken together, GEAR provides a principled framework that evaluates abduction and supplies label-free, scalable training signals that help LLMs produce more diverse and reliable hypotheses.
Abstract（参考訳）: 大規模言語モデル (LLMs) の出現以来、研究は指示の追従と帰納的推論に重点を置いてきた。これらのモデルは新たな知識を発見できるのか、この能力をどのように評価できるのか? 本稿では, 帰納的推論(可算仮説の生成)を考察し, GEAR (General Evaluation for Abductive Reasoning) を導入し, 汎用的, 完全自動化, 透過的, ラベルフリーな評価パラダイムを提案する。 GEARは、仮説セットを3つの指標でスコア付けする: 一貫性(それぞれの仮説が観察を説明する)、一般化可能性(一貫性のある仮説は目に見えない入力について有意義な予測をする)、多様性(セットは異なる予測とパターンをカバーしている)。この方法で構築されたGEARは、スケーラブルで(人間の金の答えがない)、信頼性があり(古典的な減算と一致した決定論的スコア)、オープンエンドである(モデルが新しいプラウシブルな仮説を生成する場合にのみスコアが改善される)。 GEAR を用いて,1500 問題のある4 件の誘拐ベンチマークにおいて,9 件の LLM をきめ細かな研究を行い,5 万件以上の仮説を導出し,ゴールド・アンサーや純粋人間による評価で明らかなモデル差を明らかにする。さらに,GEARから得られた学習データを学習速度によって調整するモーメントベースのカリキュラムを提案する。金ラベルの監督がなければ、この戦略はすべてのGEAR目標を改善し、これらは確立された帰納的推論ベンチマークに移行する。 GEARは、誘拐を評価し、ラベルのないスケーラブルなトレーニング信号を提供し、LSMがより多様で信頼性の高い仮説を生成するのに役立つ、原則化されたフレームワークを提供する。

論文の概要: GEAR: A General Evaluation Framework for Abductive Reasoning

関連論文リスト