Fugu-MT 論文翻訳(概要): Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots

論文の概要: Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots

arxiv url: http://arxiv.org/abs/2603.23682v1
Date: Tue, 24 Mar 2026 19:39:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.011988
Title: Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots
Title（参考訳）: AI時代の評価設計:人間とチャットボットに異なる機能を持つ項目を特定する方法
Authors: Licol Zeinfeld, Alona Strugatski, Ziva Bar-Dov, Ron Blonder, Shelley Rap, Giora Alexandron,
Abstract要約: 教育における大規模言語モデル(LLM)の急速な採用は、アセスメント設計に重大な課題をもたらす。我々は,人間とLLMが体系的な応答差を示す項目を特定するために,統計的に原則化されたアプローチを導入する。この方法は、差分アイテム機能解析(DIF)に基づく。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The rapid adoption of large language models (LLMs) in education raises profound challenges for assessment design. To adapt assessments to the presence of LLM-based tools, it is crucial to characterize the strengths and weaknesses of LLMs in a generalizable, valid and reliable manner. However, current LLM evaluations often rely on descriptive statistics derived from benchmarks, and little research applies theory-grounded measurement methods to characterize LLM capabilities relative to human learners in ways that directly support assessment design. Here, by combining educational data mining and psychometric theory, we introduce a statistically principled approach for identifying items on which humans and LLMs show systematic response differences, pinpointing where assessments may be most vulnerable to AI misuse, and which task dimensions make problems particularly easy or difficult for generative AI. The method is based on Differential Item Functioning (DIF) analysis -- traditionally used to detect bias across demographic groups -- together with negative control analysis and item-total correlation discrimination analysis. It is evaluated on responses from human learners and six leading chatbots (ChatGPT-4o \& 5.2, Gemini 1.5 \& 3 Pro, Claude 3.5 \& 4.5 Sonnet) to two instruments: a high school chemistry diagnostic test and a university entrance exam. Subject-matter experts then analyzed DIF-flagged items to characterize task dimensions associated with chatbot over- or under-performance. Results show that DIF-informed analytics provide a robust framework for understanding where LLM and human capabilities diverge, and highlight their value for improving the design of valid, reliable, and fair assessment in the AI era.
Abstract（参考訳）: 教育における大規模言語モデル(LLM)の急速な採用は、アセスメント設計に重大な課題をもたらす。 LLMツールの存在に適応するためには, LLMの強度と弱点を, 汎用的で有効かつ信頼性の高い方法で特徴付けることが重要である。しかしながら、現在のLLM評価はベンチマークから導かれる記述統計に頼っていることが多く、評価設計を直接支援する方法において、人間の学習者に対してLLMの能力を特徴付けるための理論基底計測法はほとんど研究されていない。ここでは、教育データマイニングと心理測定理論を組み合わせることで、人間とLLMが体系的な応答の違いを示す項目を特定する統計的に原則化されたアプローチを導入し、AIの誤用に対してどの評価が最も脆弱であるか、どのタスク次元が問題を特に容易に、あるいは困難にするかを見極める。この手法は、ディファレンシャル・アイテム・ファンクション(DIF)分析(伝統的に人口統計群間の偏りを検出するのに使われてきた)と、負の制御分析とアイテム間の相関分析に基づいている。高校の化学診断試験と大学入試の2つの機器に対する,人間の学習者と指導的チャットボット(ChatGPT-4o \& 5.2, Gemini 1.5 \& 3 Pro, Claude 3.5 \& 4.5 Sonnet)の反応を評価した。対象物質の専門家は、チャットボットのオーバーパフォーマンスやアンダーパフォーマンスに関連するタスク次元を特徴付けるためにDIFフラグ付きアイテムを分析した。結果は、DIFインフォームド分析が、LLMと人間の能力の分岐点を理解するための堅牢なフレームワークを提供し、AI時代における有効で信頼性があり公正な評価の設計を改善する上での価値を強調していることを示している。

論文の概要: Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots

関連論文リスト