Fugu-MT 論文翻訳(概要): Criterion-referenceability determines LLM-as-a-judge validity across physics assessment formats

論文の概要: Criterion-referenceability determines LLM-as-a-judge validity across physics assessment formats

arxiv url: http://arxiv.org/abs/2603.14732v1
Date: Mon, 16 Mar 2026 02:09:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.995897
Title: Criterion-referenceability determines LLM-as-a-judge validity across physics assessment formats
Title（参考訳）: 物理評価フォーマットにおけるLCM-as-a-judge妥当性の基準参照性による決定
Authors: Will Yeadon, Tom Hardy, Paul Mackay, Elise Agra,
Abstract要約: 我々は、GPT-5.2、Grok 4.1、Claude Opus 4.5、DeepSeek-V3.2、Gemini Pro 3、および盲目、解答、偽解、そして模範的な条件下でのヒトマーカーに対する委員会集計を比較した。 n=771ドルのブラインド大学試験の質問に対して、モデルは差別的妥当性の強い分数平均絶対誤差(fMAE)$approx 0.22$を達成する。 $n=55$スクリプト全体において、盲目のAIマーキングは人間のマーキングよりも厳格で可変的であり、差別的妥当性はすでに貧弱である。
参考スコア（独自算出の注目度）: 0.01116979912801043
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) are increasingly considered for automated assessment and feedback, understanding when LLM marking can be trusted is essential. We evaluate LLM-as-a-judge marking across three physics assessment formats - structured questions, written essays, and scientific plots - comparing GPT-5.2, Grok 4.1, Claude Opus 4.5, DeepSeek-V3.2, Gemini Pro 3, and committee aggregations against human markers under blind, solution-provided, false-solution, and exemplar-anchored conditions. For $n=771$ blind university exam questions, models achieve fractional mean absolute errors (fMAE) $\approx 0.22$ with robust discriminative validity (Spearman $ρ> 0.6$). For secondary and university structured questions ($n=1151$), providing official solutions reduces MAE and strengthens validity (committee $ρ= 0.88$); false solutions degrade absolute accuracy but leave rank ordering largely intact (committee $ρ= 0.77$; individual models $ρ\geq 0.59$). Essay marking behaves fundamentally differently. Across $n=55$ scripts ($n=275$ essays), blind AI marking is harsher and more variable than human marking, with discriminative validity already poor ($ρ\approx 0.1$). Adding a mark scheme does not improve discrimination ($ρ\approx 0$; all confidence intervals include zero). Anchored exemplars shift the AI mean close to the human mean and compress variance below the human standard deviation, but discriminative validity remains near-zero - distributional agreement can occur without valid discrimination. For code-based plot elements ($n=1400$), models achieve exceptionally high discriminative validity ($ρ> 0.84$) with near-linear calibration. Across all task types, validity tracks criterion-referenceability - the extent to which a task maps to explicit, observable grading features - and benchmark reliability, rather than raw model capability.
Abstract（参考訳）: 大規模言語モデル(LLM)は、自動評価とフィードバックのためにますます検討されているため、LLMマーキングがいつ信頼できるかを理解することが不可欠である。 GPT-5.2、Grok 4.1、Claude Opus 4.5、DeepSeek-V3.2、Gemini Pro 3、委員会による盲目、解答、偽解、そして模範的解答条件の比較を行った。 n=771$ブラインド大学試験の質問に対して、モデルは差別的妥当性の強い分数平均絶対誤差(fMAE)$\approx 0.22$を達成する(Spearman $ρ> 0.6$)。二次および大学の構造化された質問(n=1151$)に対して、公式なソリューションを提供することで、MAEを減らし、妥当性を強化する($ρ=0.88$)。エッセイマーキングは基本的に異なる振る舞いをする。 n=55$スクリプト(n=275$エッセイ)全体で、盲目のAIマーキングは人間のマーキングよりも厳格で変動し、差別的妥当性はすでに低い(ρ\approx 0.1$)。マークスキームを追加すると差別が改善しない(ρ\approx 0$; すべての信頼区間はゼロを含む)。 AIは人間の平均に近づき、人間の標準偏差より下の分散を圧縮するが、差別的妥当性は依然としてゼロに近い - 分配的合意は有効な差別なしに起こりうる。コードベースのプロット要素(n=1400$)の場合、モデルは非常に高い差別的妥当性(ρ> 0.84$)をほぼ線形キャリブレーションで達成する。すべてのタスクタイプにおいて、妥当性は基準参照可能性(タスクが生のモデル能力ではなく、明示的で観測可能なグレーディング機能にマップされる程度)を追跡します。

論文の概要: Criterion-referenceability determines LLM-as-a-judge validity across physics assessment formats

関連論文リスト