Fugu-MT 論文翻訳(概要): GRACE: Step-Level Benchmark for Faithful Reasoning over Context

論文の概要: GRACE: Step-Level Benchmark for Faithful Reasoning over Context

arxiv url: http://arxiv.org/abs/2606.16151v1
Date: Mon, 15 Jun 2026 03:11:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:34.046988
Title: GRACE: Step-Level Benchmark for Faithful Reasoning over Context
Title（参考訳）: GRACE: コンテキストに対する忠実な推論のためのステップレベルベンチマーク
Authors: Hoang Pham, Dong Le, Anh Tuan Luu,
Abstract要約: Chain-of-Thoughtのプロンプトは透明に見える痕跡を生成するが、個々のステップは証拠から静かに逸脱する可能性がある。 GRACEは、データ駆動型エラー分類を用いた最初の人間によるステップレベルの忠実度ベンチマークである。データ駆動分類法は、教師なしクラスタリングによってボトムアップを発見し、失敗を2つのトラックに編成する。
参考スコア（独自算出の注目度）: 43.250340595492275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can silently deviate from the source evidence, even when the final answer is correct. Existing methods detect hallucinations at the response level but fail to identify where in the chain a failure occurs or what type it is. We introduce GRACE, the first human-annotated step-level faithfulness benchmark with a data-driven error taxonomy for context-grounded textual reasoning. GRACE covers CoT traces from 10 models across 4 source datasets, with each step annotated for faithfulness, error category, and natural language explanation. A data-driven taxonomy, discovered bottom-up via unsupervised clustering, organizes failures into two tracks: GRACE-Inference (deductive errors) and GRACE-Grounding (factual grounding errors), with four categories each. The evaluation set is human-annotated and challenging by design. Our experiments reveal substantial headroom for current models. In addition, integrating step-level faithfulness signals into reinforcement learning pipelines improves both downstream accuracy and reasoning reliability.
Abstract（参考訳）: 多くの推論タスクは、文書による質問応答からルールベースの推論まで、入力コンテキストを推論するモデルを必要とする。 CoT(Chain-of-Thought)は、透明に見える痕跡を生成するが、最終的な答えが正しい場合でも、個々のステップが証拠から静かに逸脱する可能性がある。既存の方法は、応答レベルで幻覚を検出するが、チェーン内で障害が発生した場所や、それがどのタイプであるかを特定できない。 GRACEは人間に注釈を付けた最初のステップレベルの忠実度ベンチマークであり、文脈に基づくテキスト推論のためのデータ駆動型誤り分類法である。 GRACEは、4つのソースデータセットにわたる10モデルのCoTトレースをカバーしており、各ステップは忠実さ、エラーカテゴリ、自然言語の説明のために注釈付けされている。データ駆動型分類法は、教師なしクラスタリングによってボトムアップを発見し、障害をGRACE-Inference(デダクティブエラー)とGRACE-Grounding(実際のグラウンドエラー)の2つのトラックにまとめ、それぞれ4つのカテゴリに分類する。評価セットは人間によって注釈付けされ、設計によって挑戦される。我々の実験は、現在のモデルのためのかなりのヘッドルームを明らかにした。さらに、ステップレベルの忠実度信号を強化学習パイプラインに統合することで、下流の精度と推論信頼性が向上する。

論文の概要: GRACE: Step-Level Benchmark for Faithful Reasoning over Context

関連論文リスト