Related papers: Benchmarking the Pedagogical Knowledge of Large Language Models

Benchmarking the Pedagogical Knowledge of Large Language Models

URL: http://arxiv.org/abs/2506.18710v3
Date: Tue, 01 Jul 2025 15:49:58 GMT
Title: Benchmarking the Pedagogical Knowledge of Large Language Models
Authors: Maxime Lelièvre, Amy Waldock, Meng Liu, Natalia Valdés Aspillaga, Alasdair Mackintosh, María José Ogando Portela, Jared Lee, Paul Atherton, Robin A. A. Ince, Oliver G. B. Garrod,
Abstract summary: This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their pedagogical knowledge.<n>These benchmarks are built on a carefully curated set of questions sourced from professional development exams for teachers.<n>We report results for 97 models, with accuracies spanning a range from 28% to 89% on the pedagogical knowledge questions.
Score: 4.417539128489408
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Benchmarks like Massive Multitask Language Understanding (MMLU) have played a pivotal role in evaluating AI's knowledge and abilities across diverse domains. However, existing benchmarks predominantly focus on content knowledge, leaving a critical gap in assessing models' understanding of pedagogy - the method and practice of teaching. This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their Cross-Domain Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND) pedagogical knowledge. These benchmarks are built on a carefully curated set of questions sourced from professional development exams for teachers, which cover a range of pedagogical subdomains such as teaching strategies and assessment methods. Here we outline the methodology and development of these benchmarks. We report results for 97 models, with accuracies spanning a range from 28% to 89% on the pedagogical knowledge questions. We consider the relationship between cost and accuracy and chart the progression of the Pareto value frontier over time. We provide online leaderboards at https://rebrand.ly/pedagogy which are updated with new models and allow interactive exploration and filtering based on various model properties, such as cost per token and open-vs-closed weights, as well as looking at performance in different subjects. LLMs and generative AI have tremendous potential to influence education and help to address the global learning crisis. Education-focused benchmarks are crucial to measure models' capacities to understand pedagogical concepts, respond appropriately to learners' needs, and support effective teaching practices across diverse contexts. They are needed for informing the responsible and evidence-based deployment of LLMs and LLM-based tools in educational settings, and for guiding both development and policy decisions.

Related papers

Decoding Instructional Dialogue: Human-AI Collaborative Analysis of Teacher Use of AI Tool at Scale [9.092920230987684]
The integration of large language models into educational tools has the potential to substantially impact how teachers plan instruction.<n>This paper presents a human-AI collaborative methodology for large-scale qualitative analysis of over 140,000 educator-AI messages.
arXiv Detail & Related papers (2025-07-23T23:23:38Z)
From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning [76.09281171131941]
Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy.<n>We propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors.
arXiv Detail & Related papers (2025-05-21T15:00:07Z)
EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework [9.76455227840645]
Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging.<n>We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios.
arXiv Detail & Related papers (2025-04-21T07:48:20Z)
LLMs as Educational Analysts: Transforming Multimodal Data Traces into Actionable Reading Assessment Reports [6.523137821124204]
This study investigates the use of multimodal data sources to derive meaningful reading insights.<n>We employ unsupervised learning techniques to identify distinct reading behavior patterns.<n>A large language model (LLM) synthesizes the derived information into actionable reports for educators.
arXiv Detail & Related papers (2025-03-03T22:34:08Z)
MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors [76.1634959528817]
We present MathTutorBench, an open-source benchmark for holistic tutoring model evaluation.<n>MathTutorBench contains datasets and metrics that broadly cover tutor abilities as defined by learning sciences research in dialog-based teaching.<n>We evaluate a wide set of closed- and open-weight models and find that subject expertise, indicated by solving ability, does not immediately translate to good teaching.
arXiv Detail & Related papers (2025-02-26T08:43:47Z)
Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models [30.759154473275043]
This study introduces a benchmark to evaluate the questioning capability in education as a teacher of large language models (LLMs) We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs' outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher.
arXiv Detail & Related papers (2024-08-20T15:36:30Z)
Scaffolding Language Learning via Multi-modal Tutoring Systems with Pedagogical Instructions [34.760230622675365]
Intelligent tutoring systems (ITSs) imitate human tutors and aim to provide customized instructions or feedback to learners. With the emergence of generative artificial intelligence, large language models (LLMs) entitle the systems to complex and coherent conversational interactions. We investigate how pedagogical instructions facilitate the scaffolding in ITSs, by conducting a case study on guiding children to describe images for language learning.
arXiv Detail & Related papers (2024-04-04T13:22:28Z)
Evaluating and Optimizing Educational Content with Large Language Model Judgments [52.33701672559594]
We use Language Models (LMs) as educational experts to assess the impact of various instructions on learning outcomes. We introduce an instruction optimization approach in which one LM generates instructional materials using the judgments of another LM as a reward function. Human teachers' evaluations of these LM-generated worksheets show a significant alignment between the LM judgments and human teacher preferences.
arXiv Detail & Related papers (2024-03-05T09:09:15Z)
UKP-SQuARE: An Interactive Tool for Teaching Question Answering [61.93372227117229]
The exponential growth of question answering (QA) has made it an indispensable topic in any Natural Language Processing (NLP) course. We introduce UKP-SQuARE as a platform for QA education. Students can run, compare, and analyze various QA models from different perspectives.
arXiv Detail & Related papers (2023-05-31T11:29:04Z)
Opportunities and Challenges in Neural Dialog Tutoring [54.07241332881601]
We rigorously analyze various generative language models on two dialog tutoring datasets for language learning. We find that although current approaches can model tutoring in constrained learning scenarios, they perform poorly in less constrained scenarios. Our human quality evaluation shows that both models and ground-truth annotations exhibit low performance in terms of equitable tutoring.
arXiv Detail & Related papers (2023-01-24T11:00:17Z)
Neural Multi-Task Learning for Teacher Question Detection in Online Classrooms [50.19997675066203]
We build an end-to-end neural framework that automatically detects questions from teachers' audio recordings. By incorporating multi-task learning techniques, we are able to strengthen the understanding of semantic relations among different types of questions.
arXiv Detail & Related papers (2020-05-16T02:17:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.