Fugu-MT 論文翻訳(概要): Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

論文の概要: Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

arxiv url: http://arxiv.org/abs/2605.14322v2
Date: Wed, 20 May 2026 17:47:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 20:14:18.40717
Title: Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows
Title（参考訳）: エージェントが教える準備はできているか? 実世界の教師ワークフローのためのマルチステージベンチマーク
Authors: Zixin Chen, Peng Liu, Rui Sheng, Haobo Li, Jianhong Tu, Xiaodong Deng, Kashun Shum, Dayiheng Liu, Huamin Qu,
Abstract要約: EduAgentBenchは、教授作業の全範囲でチューターエージェントを評価するための、ソースグラウンドのベンチマークである。専門的な教育的判断、複数ターンのチューターの配置、Canvasスタイルの教育ワークフローの補完という、3つの機能面にわたる品質管理タスクが150種類含まれている。我々の知る限り、EduAgentBenchは、チューターエージェントの総合的な教育能力を評価するための理論的かつ現実的なベンチマークである。
参考スコア（独自算出の注目度）: 48.61619205237941
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language agents are increasingly deployed in complex professional workflows, with tutoring emerging as a particularly high-stakes capability that remains largely unmeasured in existing benchmarks. Effective tutor agents require more than producing correct answers or executing accurate tool calls: a robust tutor must diagnose learner state, adapt support over time, make pedagogically justified decisions grounded in educational evidence, and execute interventions within realistic learning-management systems. We introduce EduAgentBench, a source-grounded benchmark for holistically evaluating tutor agents across the full scope of teaching work. It contains 150 quality-controlled tasks across three capability surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion. Tasks are constructed through a pedagogical-insight-driven pipeline and evaluated with complementary verification signals and human review. Across a comprehensive evaluation of frontier models, our findings reveal that current models are generally capable of bounded pedagogical judgment, but still fall short of professional teaching standards in situated tutoring and autonomous teaching-workflow execution. To our knowledge, EduAgentBench is the first theory-grounded and realistic benchmark for evaluating the holistic teaching capability of tutor agents, providing a measurement foundation for developing future tutor agents that can support realistic teaching work.
Abstract（参考訳）: 言語エージェントは、複雑なプロフェッショナルワークフローにますますデプロイされ、既存のベンチマークでほとんど測定されていない、特に高い評価能力として家庭教師が登場しています。堅牢な家庭教師は、学習者の状態を診断し、時間の経過とともに支援を適応し、教育的証拠に根ざした教育学的に正当化された決定を下し、現実的な学習管理システム内の介入を実行する必要がある。 EduAgentBenchは、教職の全面的な範囲でチューターエージェントを階層的に評価するための、ソースグラウンドのベンチマークである。専門的な教育的判断、複数ターンのチューターの配置、Canvasスタイルの教育ワークフローの補完という、3つの機能面にわたる品質管理タスクが150種類含まれている。タスクは、教育的視点駆動パイプラインを通して構築され、補完的な検証信号と人間のレビューで評価される。本研究は,フロンティアモデルの包括的評価から,現在のモデルでは一般に境界教育の判断が可能であるが,位置学習や自律型学習ワークフローの実行における専門的な教育基準に欠けていることを明らかにする。我々の知る限り、EduAgentBenchは、教師エージェントの総合的な教育能力を評価するための最初の理論的および現実的なベンチマークであり、現実的な教育作業を支援する将来の教師エージェントを開発するための測定基盤を提供する。

論文の概要: Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

関連論文リスト