Fugu-MT 論文翻訳(概要): Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

論文の概要: Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

arxiv url: http://arxiv.org/abs/2603.16654v1
Date: Tue, 17 Mar 2026 15:23:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.369874
Title: Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models
Title（参考訳）: Omanic: 大規模言語モデルにおけるマルチホップ推論の段階的評価に向けて
Authors: Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, Irene Li,
Abstract要約: OmanicはオープンドメインのマルチホップQAリソースであり、推論プロセスを分析するための構造アノテーションとして分解されたサブクエストと中間回答を提供する。 10,296個の機械によるトレーニング例(Omanic Synth)と967個の専門家による注釈付き評価例(OmanicBench)を含む。
参考スコア（独自算出の注目度）: 60.418191092851636
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reasoning-focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi-hop QA benchmarks lack step-level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open-domain multi-hop QA resource that provides decomposed sub-questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench). Systematic evaluations show that state-of-the-art LLMs achieve only 73.11% multiple-choice accuracy on OmanicBench, confirming its high difficulty. Stepwise analysis reveals that CoT's performance hinges on factual completeness, with its gains diminishing under knowledge gaps and errors amplifying in later hops. Additionally, supervised fine-tuning on OmanicSynth brings substantial transfer gains (7.41 average points) across six reasoning and math benchmarks, validating the dataset's quality and further supporting the effectiveness of OmanicSynth as supervision for reasoning-capability transfer. We release the data at https://huggingface.co/datasets/li-lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.
Abstract（参考訳）: 最終的な答えだけでは中間的推論ステップを公開せず、モデルが真の理由と障害発生場所を判断することが困難であるのに対して、既存のマルチホップQAベンチマークでは、推論失敗を診断するための段階レベルのアノテーションが欠如している。このギャップに対処するため,提案するオープンドメインマルチホップQAリソースであるOmanicを提案する。 10,296の機械生成トレーニング例(OmanicSynth)と967の専門家レビューによる人手による評価例(OmanicBench)が含まれている。システム評価では、最先端のLCMはOmanicBench上で73.11%の多重選択精度しか達成せず、高い難しさが確認されている。ステップワイズ分析により、CoTのパフォーマンスは事実の完全性に依存しており、知識のギャップやエラーによって利益が減少し、後続のホップが増幅されることが明らかになった。さらに、OmanicSynthの教師付き微調整は、6つの推論と数学のベンチマークでかなりの転送ゲイン(7.41平均点)をもたらし、データセットの品質を検証し、推論能力の転送の監督としてOmanicSynthの有効性をさらに支援している。データはhttps://huggingface.co/datasets/li-lab/Omanicで、コードはhttps://github.com/XiaojieGu/Omanicでリリースします。

論文の概要: Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

関連論文リスト