Fugu-MT 論文翻訳(概要): Assessing the Pedagogical Readiness of Large Language Models as AI Tutors in Low-Resource Contexts: A Case Study of Nepal's K-10 Curriculum

論文の概要: Assessing the Pedagogical Readiness of Large Language Models as AI Tutors in Low-Resource Contexts: A Case Study of Nepal's K-10 Curriculum

arxiv url: http://arxiv.org/abs/2604.09619v1
Date: Tue, 17 Mar 2026 04:37:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.564551
Title: Assessing the Pedagogical Readiness of Large Language Models as AI Tutors in Low-Resource Contexts: A Case Study of Nepal's K-10 Curriculum
Title（参考訳）: 低資源環境におけるAIチューターとしての大規模言語モデルの教育的準備性の評価:ネパールのK-10カリキュラムを事例として
Authors: Pratyush Acharya, Prasansha Bharati, Yokibha Chapagain, Isha Sharma Gauli, Kiran Parajuli,
Abstract要約: 大規模言語モデルの教育エコシステムへの統合は、パーソナライズされた家庭教師へのアクセスを民主化することを約束する。本研究では,4つの最先端LLM--GPT-4o,Claude Sonnet 4,Qwen3-235B,Kim K2の系統的評価を行い,AI教師としての能力を評価する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The integration of Large Language Models (LLMs) into educational ecosystems promises to democratize access to personalized tutoring, yet the readiness of these systems for deployment in non-Western, low-resource contexts remains critically under-examined. This study presents a systematic evaluation of four state-of-the-art LLMs--GPT-4o, Claude Sonnet 4, Qwen3-235B, and Kimi K2--assessing their capacity to function as AI tutors within the specific curricular and cultural framework of Nepal's Grade 5-10 Science and Mathematics education. We introduce a novel, curriculum-aligned benchmark and a fine-grained evaluation framework inspired by the "natural language unit tests" paradigm, decomposing pedagogical efficacy into seven binary metrics: Prompt Alignment, Factual Correctness, Clarity, Contextual Relevance, Engagement, Harmful Content Avoidance, and Solution Accuracy. Our results reveal a stark "curriculum-alignment gap." While frontier models (GPT-4o, Claude Sonnet 4) achieve high aggregate reliability (approximately 97%), significant deficiencies persist in pedagogical clarity and cultural contextualization. We identify two pervasive failure modes: the "Expert's Curse," where models solve complex problems but fail to explain them clearly to novices, and the "Foundational Fallacy," where performance paradoxically degrades on simpler, lower-grade material due to an inability to adapt to younger learners' cognitive constraints. Furthermore, regional models like Kimi K2 exhibit a "Contextual Blindspot," failing to provide culturally relevant examples in over 20% of interactions. These findings suggest that off-the-shelf LLMs are not yet ready for autonomous deployment in Nepalese classrooms. We propose a "human-in-the-loop" deployment strategy and offer a methodological blueprint for curriculum-specific fine-tuning to align global AI capabilities with local educational needs.
Abstract（参考訳）: 教育エコシステムへのLLM(Large Language Models)の統合は、パーソナライズされたチュータへのアクセスを民主化することを約束するが、非西洋的で低リソースのコンテキストに配置するためのこれらのシステムの準備は、依然として極めて過小評価されている。本研究は,ネパールの理・数学教育における特定のカリキュラムと文化の枠組みの中で,AI教師として機能する能力を評価するため,最先端の4つのLLM--GPT-4o,クロードソネット4,Qwen3-235B,キミK2を体系的に評価した。本稿では,「自然言語単体テスト」パラダイムにインスパイアされた,新たなカリキュラム整合性ベンチマークと詳細な評価フレームワークを導入し,教育効果を7つのバイナリメトリクスに分解する。我々の結果は「カリキュラムアライメントのギャップ」を暗示している。フロンティアモデル(GPT-4o、Claude Sonnet 4)は高い集合信頼性(約97%)を達成する一方で、教育的明確性と文化的文脈化において重大な欠陥が持続する。モデルが複雑な問題を解くが、初心者に明確に説明できないような"Expert's Curse"と、若い学習者の認知的制約に適応できないため、パフォーマンスがより単純で低い素材にパラドックス的に低下する"Foundational Fallacy"の2つを識別する。さらに、Kim K2のような地域モデルは「コンテキストブランドスポット」を示しており、20%以上の相互作用において文化的に関連する例を提供していない。これらの結果から, ネパールの教室において, 既設のLCMは, 自律的な展開の準備ができていないことが示唆された。我々は,カリキュラム固有の微調整のための方法論的青写真を提供し,グローバルなAI能力をローカルな教育的ニーズと整合させる「ヒューマン・イン・ザ・ループ」デプロイメント戦略を提案する。

論文の概要: Assessing the Pedagogical Readiness of Large Language Models as AI Tutors in Low-Resource Contexts: A Case Study of Nepal's K-10 Curriculum

関連論文リスト