Fugu-MT 論文翻訳(概要): New Exam Security Questions in the AI Era: Comparing AI-Generated Item Similarity Between Naive and Detail-Guided Prompting Approaches

論文の概要: New Exam Security Questions in the AI Era: Comparing AI-Generated Item Similarity Between Naive and Detail-Guided Prompting Approaches

arxiv url: http://arxiv.org/abs/2512.23729v1
Date: Fri, 19 Dec 2025 20:34:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-04 08:45:17.125838
Title: New Exam Security Questions in the AI Era: Comparing AI-Generated Item Similarity Between Naive and Detail-Guided Prompting Approaches
Title（参考訳）: AI時代の新たなエクササイズセキュリティ問題:AIが生成した項目のナイーブとディテールガイドによるプロンプティングアプローチの比較
Authors: Ting Wang, Caroline Prendergast, Susan Lottridge,
Abstract要約: 大規模言語モデル(LLM)は、ドメイン固有の多重選択質問(MCQ)を生成する強力なツールとして登場した。本研究は, LLM生成物が, 公開資源のみを用いて生成したものと有意に異なるか否かを考察した。
参考スコア（独自算出の注目度）: 3.628322895108074
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have emerged as powerful tools for generating domain-specific multiple-choice questions (MCQs), offering efficiency gains for certification boards but raising new concerns about examination security. This study investigated whether LLM-generated items created with proprietary guidance differ meaningfully from those generated using only publicly available resources. Four representative clinical activities from the American Board of Family Medicine (ABFM) blueprint were mapped to corresponding Entrustable Professional Activities (EPAs), and three LLMs (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) produced items under a naive strategy using only public EPA descriptors, while GPT-4o additionally produced items under a guided strategy that incorporated proprietary blueprints, item-writing guidelines, and exemplar items, yielding 160 total items. Question stems and options were encoded using PubMedBERT and BioBERT, and intra- and inter-strategy cosine similarity coefficients were calculated. Results showed high internal consistency within each prompting strategy, while cross-strategy similarity was lower overall. However, several domain model pairs, particularly in narrowly defined areas such as viral pneumonia and hypertension, exceeded the 0.65 threshold, indicating convergence between naive and guided pipelines. These findings suggest that while proprietary resources impart distinctiveness, LLMs prompted only with public information can still generate items closely resembling guided outputs in constrained clinical domains, thereby heightening risks of item exposure. Safeguarding the integrity of high stakes examinations will require human-first, AI-assisted item development, strict separation of formative and summative item pools, and systematic similarity surveillance to balance innovation with security.
Abstract（参考訳）: 大規模言語モデル(LLM)は、ドメイン固有の多重選択質問(MCQ)を生成する強力なツールとして登場し、認証ボードの効率向上を提供する一方で、試験セキュリティに関する新たな懸念を提起している。本研究は, LLM生成物が, 公開資源のみを用いて生成したものと有意に異なるか否かを考察した。アメリカ家族医療委員会(ABFM)のブループリントを対応するEPA(Entrustable Professional Activity)にマッピングし、3つのLCM(GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash)が公的なEPA記述子のみを用いてナイーブな戦略でアイテムを作成した。 PubMedBERTとBioBERTを用いて質問紙とオプションを符号化し, ストラテジー内およびストラテジー間コサイン類似度係数を算出した。その結果,各プロンプト戦略の内的整合性は高く,クロスストラテジーの類似性は全体的に低かった。しかし、特にウイルス性肺炎や高血圧などの狭義の領域では、いくつかのドメインモデルペアが0.65閾値を超え、ナイーブパイプラインとガイドパイプラインの収束を示している。これらの知見は, 独占的資源が独特性を付与する一方で, 公開情報のみによって誘導されるLCMは, 制約された臨床領域におけるガイド付きアウトプットによく似た項目を生成でき, 項目暴露のリスクを高めることを示唆している。高利得検査の完全性を保護するためには、人間優先のAI支援アイテム開発、形式的および要約的なアイテムプールの厳格な分離、イノベーションとセキュリティのバランスをとるための体系的な類似性監視が必要である。

論文の概要: New Exam Security Questions in the AI Era: Comparing AI-Generated Item Similarity Between Naive and Detail-Guided Prompting Approaches

関連論文リスト