Fugu-MT 論文翻訳(概要): Estimating Item Difficulty with Large Language Models as Experts

論文の概要: Estimating Item Difficulty with Large Language Models as Experts

arxiv url: http://arxiv.org/abs/2605.18562v1
Date: Mon, 18 May 2026 15:42:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:49.920733
Title: Estimating Item Difficulty with Large Language Models as Experts
Title（参考訳）: エキスパートとしての大規模言語モデルによる項目難易度の推定
Authors: Diana Kolesnikova, Kirill Fedyanin, Abe D. Hofman, Matthieu J. S. Brinkhuis, Maria Bolsinova,
Abstract要約: 本研究は,3つの既成言語モデルを,応答データにアクセスできることなく,新たに作成された項目のラベル付けが困難であるとして評価した。この調査では、判断形式(絶対対対ペア)、決定型(ハードな決定対トークン確率に基づく見積もり)、そして戦略の推進という3つの要因を横断する完全な因子的設計が用いられた。 LLMをベースとした評価では, 実験項目の難易度と中程度から強い正の相関が認められた。
参考スコア（独自算出の注目度）: 0.44101646956991475
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Accurate estimates of item difficulty are essential for valid assessment and effective adaptive learning. However, for newly created tasks, response data are typically unavailable. Pretesting and expert judgement can be costly and slow, while machine learning methods often require large labelled training datasets. Recent work suggests that large language models (LLMs) may help. However, there is limited evidence on the elicitation procedures and prompt configurations used to emulate experts for difficulty estimation. This study addresses this gap by evaluating three off-the-shelf LLMs as difficulty raters for newly created items without access to response data. Using an item bank from an online learning system, the study examined 6 domains of primary-school mathematics, with empirical difficulty estimates treated as empirical reference. The study used a full factorial design crossing three factors: judgement format (absolute vs pairwise), decision type (hard decisions vs token-probability-based estimates), and prompting strategy (zero-shot vs few-shot). LLM-derived difficulty estimates were compared with empirical difficulties using Spearman rank correlations. Across domains, LLM-based estimates exhibited moderate to strong positive correlations with empirical item difficulties. For simpler arithmetic tasks, some configurations approached the upper end of the accuracy range reported for human experts in previous research. Pairwise comparison consistently outperformed absolute judgement in the absence of additional refinements. However, when token-level probabilities were incorporated and examples of items with known empirical difficulty were provided, the absolute judgement configuration likewise demonstrated moderate-to-high alignment. The study positions LLMs as a promising tool for initial item calibration and offers insights into effective workflow configuration.
Abstract（参考訳）: アイテムの難易度を正確に推定することは、有効な評価と効果的な適応学習に不可欠である。しかし、新しく作成されたタスクでは、レスポンスデータは一般的に利用できない。事前テストと専門家による判断はコストがかかり遅く、マシンラーニングの手法では大きなラベル付きトレーニングデータセットが必要になることが多い。最近の研究は、大きな言語モデル(LLM)が役に立つことを示唆している。しかし、鑑定の困難さを鑑定するために専門家をエミュレートする手順やプロンプトの構成については、限られた証拠がある。本研究は, 3つの既製LCMを, 応答データにアクセスできることなく, 新規作成品のレーダとして評価することにより, このギャップを解消するものである。オンライン学習システムからの項目バンクを用いて,小学校数学の6分野について,経験的難易度推定を経験的基準として検討した。この調査では、判断形式(絶対対ペア)、決定型(ハードな決定とトークン確率に基づく見積もり)、戦略(ゼロショット対数ショット)の3つの要因を横断する完全な要因設計が使用された。 LLMに基づく難易度推定は,スピアマンランク相関を用いた経験的困難度と比較した。 LLMをベースとした評価では, 実験項目の難易度と中程度から強い正の相関が認められた。単純な算術的なタスクのために、以前の研究で人間の専門家に報告された精度範囲の上端に近づいた設定もある。ペアワイズ比較は、改良が加えられていない場合、絶対的な判断を一貫して上回った。しかし,トークンレベルの確率が組み込まれ,経験的困難が知られている項目の例が提供されると,絶対判定構成も適度に高いアライメントを示した。この研究は、LSMをアイテムキャリブレーションの有望なツールとして位置づけ、効果的なワークフロー構成に関する洞察を提供する。

論文の概要: Estimating Item Difficulty with Large Language Models as Experts

関連論文リスト