Fugu-MT 論文翻訳(概要): Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

論文の概要: Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

arxiv url: http://arxiv.org/abs/2604.06385v1
Date: Tue, 07 Apr 2026 19:16:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.198985
Title: Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning
Title（参考訳）: 強化学習と教師付きファインチューニングによるオープンソースLLMのアプリケーション駆動教育的知識最適化
Authors: Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao, Ritankar Das,
Abstract要約: 本稿では,大規模言語モデル(LLM)の教育的知識を高めるために,強化学習(RL)と教師付き微調整(SFT)を組み合わせた革新的な多段階最適化戦略を提案する。 EduQwen 32B-RL1、EduQwen 32B-SFT、EduQwen 32B-SFT-RL2は、密度の高いQwen3-32Bのバックボーン上に構築されたオープンソースの教育用LLMのアプリケーション駆動ファミリである。
参考スコア（独自算出の注目度）: 0.5329114964121364
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present an innovative multi-stage optimization strategy combining reinforcement learning (RL) and supervised fine-tuning (SFT) to enhance the pedagogical knowledge of large language models (LLMs), as illustrated by EduQwen 32B-RL1, EduQwen 32B-SFT, and an optional third-stage model EduQwen 32B-SFT-RL2: (1) RL optimization that implements progressive difficulty training, focuses on challenging examples, and employs extended reasoning rollouts; (2) a subsequent SFT phase that leverages the RL-trained model to synthesize high-quality training data with difficulty-weighted sampling; and (3) an optional second round of RL optimization. EduQwen 32B-RL1, EduQwen 32B-SFT, and EduQwen 32B-SFT-RL2 are an application-driven family of open-source pedagogical LLMs built on a dense Qwen3-32B backbone. These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly larger proprietary systems such as the previous benchmark leader Gemini-3 Pro. These dense 32-billion-parameter models demonstrate that domain-specialized optimization can transform mid-sized open-source LLMs into true pedagogical domain experts that outperform much larger general-purpose systems, while preserving the transparency, customizability, and cost-efficiency required for responsible educational AI deployment.
Abstract（参考訳）: EduQwen 32B-RL1, EduQwen 32B-SFT, and an optional third-stage model EduQwen 32B-SFT-RL2: (1) RL Optimization that implement Progress difficulty training, focus on challenge example, and using extended reasoning rollouts; (2) 続くSFT phaseは、RL学習モデルを利用して、困難で重み付けされたサンプリングで高品質なトレーニングデータを合成する。 EduQwen 32B-RL1、EduQwen 32B-SFT、EduQwen 32B-SFT-RL2は、密度の高いQwen3-32Bのバックボーン上に構築されたオープンソースの教育用LLMのアプリケーション駆動ファミリである。これらのモデルは、Cross-Domain Pedagogical Knowledge (CDPK)ベンチマークにおいて、インタラクティブなPedagogy Benchmark Leaderboard全体にわたって新しい最先端(SOTA)結果を確立し、以前のベンチマークリーダーであるGemini-3 Proのようなはるかに大きなプロプライエタリシステムを上回る、十分な精度を実現している。これらの密集した32ビリオンパラメータモデルは、中規模のオープンソースLLMを真の教育用ドメインエキスパートに変換し、より大規模な汎用システムより優れ、透明性、カスタマイズ性、コスト効率を保ちながら、教育用AIデプロイメントに責任を負うことを実証している。

論文の概要: Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

関連論文リスト