Fugu-MT 論文翻訳(概要): EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

論文の概要: EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

arxiv url: http://arxiv.org/abs/2511.06890v1
Date: Mon, 10 Nov 2025 09:42:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-11 21:18:45.18486
Title: EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers
Title（参考訳）: EduGuardBench シミュレーション教師としての LLM の教育的忠実度と敵対的安全性を評価するための完全性ベンチマーク
Authors: Yilin Jiang, Mingzi Zhang, Xuanyu Yin, Sheng Jin, Suyu Lu, Zuocan Ying, Zengyi Yu, Xiangjie Kong,
Abstract要約: 職業シミュレーションのための大規模言語モデル(SP-LLM)は、パーソナライズされた教育において重要である。 EduGuardBenchはロールプレイングフィデリティスコア(RFS)を用いてプロのフィデリティを評価するまた、一般的な害、特に学術的不正をターゲットとしたペルソナベースの敵対的プロンプトを使用して安全性の脆弱性を調査する。
参考スコア（独自算出の注目度）: 8.123835490773095
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models for Simulating Professions (SP-LLMs), particularly as teachers, are pivotal for personalized education. However, ensuring their professional competence and ethical safety is a critical challenge, as existing benchmarks fail to measure role-playing fidelity or address the unique teaching harms inherent in educational scenarios. To address this, we propose EduGuardBench, a dual-component benchmark. It assesses professional fidelity using a Role-playing Fidelity Score (RFS) while diagnosing harms specific to the teaching profession. It also probes safety vulnerabilities using persona-based adversarial prompts targeting both general harms and, particularly, academic misconduct, evaluated with metrics including Attack Success Rate (ASR) and a three-tier Refusal Quality assessment. Our extensive experiments on 14 leading models reveal a stark polarization in performance. While reasoning-oriented models generally show superior fidelity, incompetence remains the dominant failure mode across most models. The adversarial tests uncovered a counterintuitive scaling paradox, where mid-sized models can be the most vulnerable, challenging monotonic safety assumptions. Critically, we identified a powerful Educational Transformation Effect: the safest models excel at converting harmful requests into teachable moments by providing ideal Educational Refusals. This capacity is strongly negatively correlated with ASR, revealing a new dimension of advanced AI safety. EduGuardBench thus provides a reproducible framework that moves beyond siloed knowledge tests toward a holistic assessment of professional, ethical, and pedagogical alignment, uncovering complex dynamics essential for deploying trustworthy AI in education. See https://github.com/YL1N/EduGuardBench for Materials.
Abstract（参考訳）: 職業シミュレーションのための大規模言語モデル(SP-LLM)は、特に教師として、パーソナライズされた教育において重要である。しかし、既存のベンチマークでは、ロールプレイングの忠実度を測定したり、教育シナリオに固有のユニークな教育上の害に対処できないため、プロフェッショナルな能力と倫理的安全性を確保することは重要な課題である。これを解決するために、デュアルコンポーネントベンチマークであるEduGuardBenchを提案する。ロールプレイング・フィデリティスコア(RFS)を用いて、プロのフィデリティを評価すると同時に、教職特有の害を診断する。また、一般的な害、特に学術的な不正行為の両方をターゲットにしたペルソナベースの敵対的プロンプトを用いて安全性の脆弱性を調査し、アタック成功率(ASR)や3段階の拒絶品質評価(Refusal Quality Assessment)などの指標で評価した。 14の先行モデルに対する広範な実験により、性能の極性が明らかとなった。推論指向モデルは一般に優れた忠実度を示すが、ほとんどのモデルで非能力が支配的な障害モードである。敵対的なテストでは、中規模のモデルが最も脆弱で、単調な安全性の前提となる、直感的なスケーリングパラドックスが明らかになった。最も安全なモデルは、理想的な教育拒否を提供することで、有害な要求を教育可能な瞬間に変換するのに優れている。この能力はASRと強く相関しており、高度なAI安全性の新たな次元を明らかにしている。そのためにEduGuardBenchは、サイロ化された知識テストを超えて、専門的、倫理的、教育的なアライメントを総合的に評価する、再現可能なフレームワークを提供する。資料はhttps://github.com/YL1N/EduGuardBenchを参照。

論文の概要: EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

関連論文リスト