Fugu-MT 論文翻訳(概要): Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

論文の概要: Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

arxiv url: http://arxiv.org/abs/2606.12767v1
Date: Thu, 11 Jun 2026 00:17:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 15:55:27.512108
Title: Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage
Title（参考訳）: 手続き的推論のための評価データセットの構築:自然性・接地・マルチホップ被覆のバランス
Authors: Sarah Elshabrawy, Rahul K. Dass, Ashok K. Goel,
Abstract要約: 本研究では,TMKに基づく質問生成手法が,手続き的およびマルチホップ推論におけるデータセット品質に与える影響について検討する。 23の教示トピックと690の質問応答ペアで、厳密なTMK生成は、全体的な品質が最も高い。 Transcript-first 生成はより学習的な質問を生成するが、文脈に依存したり弱く接地された項目がより多く生成される一方、TMK-aware 生成は高い生のマルチホップカバレッジをもたらすが、接地度は低い。
参考スコア（独自算出の注目度）: 1.873444918172383
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating procedural reasoning in AI-supported learning systems requires question-answer datasets that are both learner-like and grounded in the instructional knowledge the system is expected to use. We study how TMK-based question generation strategies affect dataset quality for procedural and multi-hop reasoning. We compare three strategies: strict generation from Task-Method-Knowledge (TMK) models, transcript-first generation with post-hoc TMK filtering, and TMK-aware generation that combines transcripts with structured guidance. To evaluate generated items, we introduce a grounding validation framework based on closed-set evidence units extracted from TMK models. The framework measures whether answers are supported by the underlying representation, whether questions are self-contained, and whether they target multi-hop procedural reasoning. Across 23 instructional topics and 690 generated question-answer pairs, strict TMK generation achieves the strongest overall quality, with 96.5% grounded questions and 92.6% usable questions. Transcript-first generation produces more learner-like questions but more context-dependent or weakly grounded items, while TMK-aware generation yields high raw multi-hop coverage but lower grounding. These results show that procedural richness and natural phrasing do not guarantee representational grounding, motivating explicit representation-aware validation for evaluation datasets in AI-supported learning.
Abstract（参考訳）: AI支援学習システムにおける手続き的推論を評価するには、学習者らしく、システムが使用するであろう教育的知識に基づく質問応答データセットが必要である。本研究では,TMKに基づく質問生成手法が,手続き的およびマルチホップ推論におけるデータセット品質に与える影響について検討する。我々は,タスクメソッド知識(TMK)モデルからの厳密な生成,ポストホックTMKフィルタリングによるトランスクリプトファースト生成,構造化ガイダンスと組み合わせたTMKアウェア生成の3つの戦略を比較した。生成項目を評価するために,TMKモデルから抽出したクローズド・セット・エビデンス・ユニットに基づくグラウンドティング・バリデーション・フレームワークを提案する。このフレームワークは、答えが基礎となる表現によって支持されるかどうか、質問が自己完結しているかどうか、そして、それらがマルチホップ手続き的推論をターゲットにしているかどうかを測定する。 23の教示トピックと690の生成した質問応答ペアで、厳密なTMK生成は96.5%の根拠のある質問と92.6%の有用な質問で、最も高い品質を実現している。 Transcript-first 生成はより学習的な質問を生成するが、文脈に依存したり弱く接地された項目がより多く生成される一方、TMK-aware 生成は高い生のマルチホップカバレッジをもたらすが、接地度は低い。これらの結果から,AI支援学習における評価データセットに対する明示的表現認識検証の動機付けとして,手続き的豊かさと自然な言い回しが表現的根拠を保証しないことが示唆された。

論文の概要: Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

関連論文リスト