Fugu-MT 論文翻訳(概要): Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

論文の概要: Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

arxiv url: http://arxiv.org/abs/2606.06546v1
Date: Thu, 04 Jun 2026 07:40:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.362488
Title: Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios
Title（参考訳）: Elmes*:長期教育シナリオにおける大規模言語モデルのための細粒度評価ルーブリックの自動構築
Authors: Tao Liu, Ye Lu, Ruohua Zhang, Siyu Song, Wentao Liu, Aimin Zhou, Hao Hao,
Abstract要約: Elmes*は、詳細なシナリオ固有のルーリックの構築、精錬、適用のためのエンドツーエンドフレームワークである。 Edu-330は、11の被験者に330のシナリオ、3つのグレードバンド、10のタスクタイプがあり、1000以上の第2レベルの指標がある。
参考スコア（独自算出の注目度）: 22.62023107953559
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly to long-tail pedagogical scenarios. We introduce Elmes*, an end-to-end framework for constructing, refining, and applying fine-grained scenario-specific rubrics. Elmes* combines a declarative multi-agent engine for teacher--student--judge interactions with SceneGen, a self-evolving module that co-optimizes evaluation criteria and test data from expert-defined pedagogical dimensions. Using Elmes*, we build Edu-330, covering 330 scenarios across 11 subjects, 3 grade bands, and 10 task types, with over 1{,}000 second-level indicators. Experiments on Edu-330 and four expert-authored gold-standard scenarios show that educational capability is multidimensional: top-tier LLMs differ mainly in creativity and values integration, knowledge-strong models may fail at Socratic scaffolding, and the education-specialized InnoSpark achieves the best human-evaluated average score. LLM judges preserve human-comparable rankings with much lower scoring variance, but exhibit judge-specific biases such as self-preference. Ablations show that expert-scored few-shot anchoring improves human--LLM alignment, while reasoning enforcement and greedy decoding are model-dependent. Elmes* thus provides scalable diagnostic infrastructure for pedagogically grounded LLM evaluation.
Abstract（参考訳）: 教育のための大きな言語モデル(LLM)を評価するには、モデルがどのように教えるかを測定する必要がある。既存のベンチマークでは、ドメイン全般の正当性を強調したり、長い尾の教育シナリオに乏しい手動設計のルーリックに依存している。 Elmes*は、細かなシナリオ固有のルーブリックの構築、精錬、適用のためのエンドツーエンドフレームワークです。 Elmes*は、教師と学生の対話のための宣言型マルチエージェントエンジンをSceneGenと組み合わせている。 Elmes*を使ってEdu-330を構築し、11の被験者、3つのグレードバンド、10のタスクタイプに1{,}000のセカンドレベルインジケータを含む330のシナリオをカバーしています。 Edu-330と4つの専門家によるゴールドスタンダードシナリオの実験では、教育能力は多次元であることが示されている。 LLMの審査員は、スコアのばらつきがはるかに低い人間の比較可能なランキングを保っているが、自己選好のような裁判官固有の偏見を示す。専門家による数発のアンカーは、人間-LLMアライメントを改善する一方、推論の強制と欲求のデコーディングはモデル依存である。 Elmes*は拡張性のある診断インフラを提供し、段階的にLLM評価を行う。

論文の概要: Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

関連論文リスト