Fugu-MT 論文翻訳(概要): Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models

論文の概要: Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models

arxiv url: http://arxiv.org/abs/2508.20217v1
Date: Wed, 27 Aug 2025 18:54:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-29 18:12:01.730386
Title: Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models
Title（参考訳）: K-12教育における言語モデルに基づく項目生成の推進方略:小・大言語モデル間のギャップを埋める
Authors: Mohammad Amini, Babak Ahmadi, Xiaomeng Xiong, Yilin Zhang, Christopher Qiao,
Abstract要約: 本研究では、言語モデルを用いた自動生成(AIG)を用いて、形態的評価のための複数選択質問(MCQ)を作成する。ゼロショット,少数ショット,チェーンオブ思考,ロールベース,シーケンシャル,組み合わせを含む7つの構造化プロンプト戦略を評価した。その結果,構造的プロンプト,特にチェーン・オブ・シンクショナルデザインとシーケンシャルデザインを組み合わせた戦略はGemmaの出力を大幅に改善した。
参考スコア（独自算出の注目度）: 5.584522240405349
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study explores automatic generation (AIG) using language models to create multiple choice questions (MCQs) for morphological assessment, aiming to reduce the cost and inconsistency of manual test development. The study used a two-fold approach. First, we compared a fine-tuned medium model (Gemma, 2B) with a larger untuned one (GPT-3.5, 175B). Second, we evaluated seven structured prompting strategies, including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations. Generated items were assessed using automated metrics and expert scoring across five dimensions. We also used GPT-4.1, trained on expert-rated samples, to simulate human scoring at scale. Results show that structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma's outputs. Gemma generally produced more construct-aligned and instructionally appropriate items than GPT-3.5's zero-shot responses, with prompt design playing a key role in mid-size model performance. This study demonstrates that structured prompting and efficient fine-tuning can enhance midsized models for AIG under limited data conditions. We highlight the value of combining automated metrics, expert judgment, and large-model simulation to ensure alignment with assessment goals. The proposed workflow offers a practical and scalable way to develop and validate language assessment items for K-12.
Abstract（参考訳）: 本研究では,言語モデルを用いた自動生成(AIG)による形態的評価のための複数選択質問(MCQ)の作成について検討し,手動テスト開発におけるコストと不整合の低減を目的とした。その研究は2倍のアプローチを用いた。まず,微調整媒体モデル (Gemma, 2B) と大型未調整モデル (GPT-3.5, 175B) を比較した。第2に、ゼロショット、少数ショット、チェーンオブ思想、ロールベース、シーケンシャル、組み合わせを含む7つの構造化プロンプト戦略を評価した。生成した項目は、自動メトリクスと5次元のエキスパートスコアを用いて評価された。また,GPT-4.1を用いて,ヒトのスコアリングのシミュレーションを行った。その結果,構造的プロンプト,特にチェーン・オブ・シンクショナルデザインとシーケンシャルデザインを組み合わせた戦略はGemmaの出力を大幅に改善した。 Gemmaは概して、GPT-3.5のゼロショット応答よりも構成整合性があり、命令的に適切なアイテムを生成し、即時設計は中規模モデルの性能において重要な役割を担った。本研究は, 構造的プロンプトと効率的な微調整により, 限られたデータ条件下でのAIGの中規模モデルの強化が可能であることを示す。評価目標との整合性を確保するために、自動メトリクス、専門家の判断、および大規模モデルシミュレーションを組み合わせることの価値を強調します。提案したワークフローは、K-12の言語アセスメント項目の開発と検証のための実用的でスケーラブルな方法を提供する。

論文の概要: Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models

関連論文リスト