EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework
- URL: http://arxiv.org/abs/2504.14928v1
- Date: Mon, 21 Apr 2025 07:48:20 GMT
- Title: EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework
- Authors: Yao Shi, Rongkeng Liang, Yong Xu,
- Abstract summary: Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging.<n>We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios.
- Score: 9.76455227840645
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging due to the resource-intensive, context-dependent, and methodologically complex nature of teacher-student interactions. We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios, featuring specialized agents for teaching, learning, and evaluation. Testing 14 LLMs across major AI Organizations (OpenAI, Meta, Google, Anthropic, and others) on 1,498 questions spanning 13 disciplines and 10 difficulty levels reveals that teaching effectiveness does not correlate linearly with model scale or general reasoning capabilities - with some smaller open-source models outperforming larger commercial counterparts in teaching contexts. This finding highlights a critical gap in current evaluations that prioritize knowledge recall over interactive pedagogy. Our mixed-methods evaluation, combining quantitative metrics with qualitative analysis and expert case studies, identifies distinct pedagogical strengths employed by top-performing models (e.g., sophisticated questioning strategies, adaptive feedback mechanisms). Human expert evaluations show 78% agreement with our automated qualitative analysis of effective teaching behaviors, validating our methodology. EducationQ demonstrates that LLMs-as-teachers require specialized optimization beyond simple scaling, suggesting next-generation educational AI prioritize targeted enhancement of specific pedagogical effectiveness.
Related papers
- Connecting Feedback to Choice: Understanding Educator Preferences in GenAI vs. Human-Created Lesson Plans in K-12 Education -- A Comparative Analysis [11.204345070162592]
generative AI (GenAI) models are increasingly explored for educational applications.<n>This study compares lesson plans authored by human curriculum designers, a fine-tuned LLaMA-2-13b model trained on K-12 content, and a customized GPT-4 model.<n>Using a large-scale preference study with K-12 math educators, we examine how preferences vary across grade levels and instructional components.
arXiv Detail & Related papers (2025-04-07T19:28:19Z) - LLM Agents for Education: Advances and Applications [49.3663528354802]
Large Language Model (LLM) agents have demonstrated remarkable capabilities in automating tasks and driving innovation across diverse educational applications.<n>This survey aims to provide a comprehensive technological overview of LLM agents for education, fostering further research and collaboration to enhance their impact for the greater good of learners and educators alike.
arXiv Detail & Related papers (2025-03-14T11:53:44Z) - MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors [76.1634959528817]
We present MathTutorBench, an open-source benchmark for holistic tutoring model evaluation.<n>MathTutorBench contains datasets and metrics that broadly cover tutor abilities as defined by learning sciences research in dialog-based teaching.<n>We evaluate a wide set of closed- and open-weight models and find that subject expertise, indicated by solving ability, does not immediately translate to good teaching.
arXiv Detail & Related papers (2025-02-26T08:43:47Z) - Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors [7.834688858839734]
We investigate whether current state-of-the-art large language models (LLMs) are effective as AI tutors.<n>We propose a unified evaluation taxonomy with eight pedagogical dimensions based on key learning sciences principles.<n>We release MRBench - a new evaluation benchmark containing 192 conversations and 1,596 responses from seven state-of-the-art LLM-based and human tutors.
arXiv Detail & Related papers (2024-12-12T16:24:35Z) - Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models [30.759154473275043]
This study introduces a benchmark to evaluate the questioning capability in education as a teacher of large language models (LLMs)
We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs' outputs.
Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher.
arXiv Detail & Related papers (2024-08-20T15:36:30Z) - PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, integrating psychology-grounded principles of personality: social practice, consistency, and dynamic development.
We incorporate personality traits directly into the model parameters, enhancing the model's resistance to induction, promoting consistency, and supporting the dynamic evolution of personality.
arXiv Detail & Related papers (2024-07-17T08:13:22Z) - The Promises and Pitfalls of Using Language Models to Measure Instruction Quality in Education [3.967610895056427]
This paper presents the first study that leverages Natural Language Processing (NLP) techniques to assess multiple high-inference instructional practices.
We confront two challenges inherent in NLP-based instructional analysis, including noisy and long input data and highly skewed distributions of human ratings.
arXiv Detail & Related papers (2024-04-03T04:15:29Z) - Evaluating and Optimizing Educational Content with Large Language Model Judgments [52.33701672559594]
We use Language Models (LMs) as educational experts to assess the impact of various instructions on learning outcomes.
We introduce an instruction optimization approach in which one LM generates instructional materials using the judgments of another LM as a reward function.
Human teachers' evaluations of these LM-generated worksheets show a significant alignment between the LM judgments and human teacher preferences.
arXiv Detail & Related papers (2024-03-05T09:09:15Z) - Opportunities and Challenges in Neural Dialog Tutoring [54.07241332881601]
We rigorously analyze various generative language models on two dialog tutoring datasets for language learning.
We find that although current approaches can model tutoring in constrained learning scenarios, they perform poorly in less constrained scenarios.
Our human quality evaluation shows that both models and ground-truth annotations exhibit low performance in terms of equitable tutoring.
arXiv Detail & Related papers (2023-01-24T11:00:17Z) - Modelling Assessment Rubrics through Bayesian Networks: a Pragmatic Approach [40.06500618820166]
This paper presents an approach to deriving a learner model directly from an assessment rubric.
We illustrate how the approach can be applied to automatize the human assessment of an activity developed for testing computational thinking skills.
arXiv Detail & Related papers (2022-09-07T10:09:12Z) - Improving mathematical questioning in teacher training [1.794107419334178]
High-fidelity, AI-based simulated classroom systems enable teachers to rehearse effective teaching strategies.
This paper builds a text-based interactive conversational agent to help teachers practice mathematical questioning skills.
arXiv Detail & Related papers (2021-12-02T05:33:03Z) - Neural Multi-Task Learning for Teacher Question Detection in Online
Classrooms [50.19997675066203]
We build an end-to-end neural framework that automatically detects questions from teachers' audio recordings.
By incorporating multi-task learning techniques, we are able to strengthen the understanding of semantic relations among different types of questions.
arXiv Detail & Related papers (2020-05-16T02:17:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.