From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench
- URL: http://arxiv.org/abs/2603.02775v1
- Date: Tue, 03 Mar 2026 09:14:57 GMT
- Title: From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench
- Authors: Weikang Shi, Houxing Ren, Junting Pan, Aojun Zhou, Ke Wang, Zimu Lu, Yunqiao Yang, Yuxuan Hu, Linda Wei, Mingjie Zhan, Hongsheng Li,
- Abstract summary: We introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess Large Language Models (LLMs)<n>The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles.<n>The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation.
- Score: 56.66490747967379
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.
Related papers
- Hierarchical Pedagogical Oversight: A Multi-Agent Adversarial Framework for Reliable AI Tutoring [0.0]
We introduce Hierarchical Pedagogical Oversight (HPO), a framework that adapts structured adversarial synthesis to educational assessment.<n>Unlike cooperative multi-agent systems that often drift toward superficial consensus, HPO enforces a dialectical separation of concerns.<n>We evaluate this framework on the MRBench dataset of 1,214 middle-school mathematics dialogues.
arXiv Detail & Related papers (2025-12-27T06:42:07Z) - EduDial: Constructing a Large-scale Multi-turn Teacher-Student Dialogue Corpus [59.693733170193944]
We present EduDial, a comprehensive multi-turn teacher-student dialogue dataset.<n>EduDial covers 345 core knowledge points and consists of 34,250 dialogue sessions generated through interactions between teacher and student agents.
arXiv Detail & Related papers (2025-10-14T18:18:43Z) - Enabling Multi-Agent Systems as Learning Designers: Applying Learning Sciences to AI Instructional Design [6.080614844688028]
This study shifts pedagogical expertise from the user's prompt to the LLM's internal architecture.<n>We tested three systems for generating secondary Math and Science learning activities.
arXiv Detail & Related papers (2025-08-20T14:44:00Z) - Benchmarking the Pedagogical Knowledge of Large Language Models [4.417539128489408]
This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their pedagogical knowledge.<n>These benchmarks are built on a carefully curated set of questions sourced from professional development exams for teachers.<n>We report results for 97 models, with accuracies spanning a range from 28% to 89% on the pedagogical knowledge questions.
arXiv Detail & Related papers (2025-06-23T14:49:01Z) - Pedagogy-R1: Pedagogically-Aligned Reasoning Model with Balanced Educational Benchmark [6.024228339466189]
Large reasoning models (LRMs) show strong performance in structured domains such as mathematics and programming.<n>LRMs often lack pedagogical coherence and realistic teaching behaviors.<n>We introduce Pedagogy-R1, a framework that adapts LRMs for classroom use through three innovations.
arXiv Detail & Related papers (2025-05-24T02:18:35Z) - From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning [82.50157695987558]
Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy.<n>We propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors.
arXiv Detail & Related papers (2025-05-21T15:00:07Z) - EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework [9.76455227840645]
Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging.<n>We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios.
arXiv Detail & Related papers (2025-04-21T07:48:20Z) - MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors [82.91830877219822]
We present MathTutorBench, an open-source benchmark for holistic tutoring model evaluation.<n>MathTutorBench contains datasets and metrics that broadly cover tutor abilities as defined by learning sciences research in dialog-based teaching.<n>We evaluate a wide set of closed- and open-weight models and find that subject expertise, indicated by solving ability, does not immediately translate to good teaching.
arXiv Detail & Related papers (2025-02-26T08:43:47Z) - ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection.
ErrorRadar evaluates two sub-tasks: error step identification and error categorization.
It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions.
Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z) - From Mimicking to Integrating: Knowledge Integration for Pre-Trained
Language Models [55.137869702763375]
This paper explores a novel PLM reuse paradigm, Knowledge Integration (KI)
KI aims to merge the knowledge from different teacher-PLMs, each of which specializes in a different classification problem, into a versatile student model.
We then design a Model Uncertainty--aware Knowledge Integration (MUKI) framework to recover the golden supervision for the student.
arXiv Detail & Related papers (2022-10-11T07:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.