Related papers: Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors

Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors

URL: http://arxiv.org/abs/2412.09416v1
Date: Thu, 12 Dec 2024 16:24:35 GMT
Title: Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors
Authors: Kaushal Kumar Maurya, KV Aditya Srivatsa, Kseniia Petukhova, Ekaterina Kochmar,
Abstract summary: We investigate whether current state-of-the-art large language models (LLMs) are effective as AI tutors.<n>We propose a unified evaluation taxonomy with eight pedagogical dimensions based on key learning sciences principles.<n>We release MRBench -- a new evaluation benchmark containing 192 conversations and 1,596 responses from seven state-of-the-art LLM-based and human tutors.
Score: 7.834688858839734
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In this paper, we investigate whether current state-of-the-art large language models (LLMs) are effective as AI tutors and whether they demonstrate pedagogical abilities necessary for good AI tutoring in educational dialogues. Previous efforts towards evaluation have been limited to subjective protocols and benchmarks. To bridge this gap, we propose a unified evaluation taxonomy with eight pedagogical dimensions based on key learning sciences principles, which is designed to assess the pedagogical value of LLM-powered AI tutor responses grounded in student mistakes or confusion in the mathematical domain. We release MRBench -- a new evaluation benchmark containing 192 conversations and 1,596 responses from seven state-of-the-art LLM-based and human tutors, providing gold annotations for eight pedagogical dimensions. We assess reliability of the popular Prometheus2 LLM as an evaluator and analyze each tutor's pedagogical abilities, highlighting which LLMs are good tutors and which ones are more suitable as question-answering systems. We believe that the presented taxonomy, benchmark, and human-annotated labels will streamline the evaluation process and help track the progress in AI tutors' development.

Related papers

PEMUTA: Pedagogically-Enriched Multi-Granular Undergraduate Thesis Assessment [7.912100274675651]
The undergraduate thesis (UGTE) plays an indispensable role in assessing a student's cumulative academic development throughout their college years.<n>Although large language models (LLMs) have advanced education intelligence, they typically focus on holistic assessment with only one single evaluation score.<n>We pioneer PEMUTA, a pedagogically-enriched framework that activates domain-specific knowledge from LLMs for multi-granular UGTE assessment.
arXiv Detail & Related papers (2025-07-25T06:47:26Z)
A Practical Guide for Supporting Formative Assessment and Feedback Using Generative AI [0.0]
Large-language models (LLMs) can help students, teachers, and peers understand "where learners are going," "where learners currently are," and "how to move learners forward"<n>This review provides a comprehensive foundation for integrating LLMs into formative assessment in a pedagogically informed manner.
arXiv Detail & Related papers (2025-05-29T12:52:43Z)
EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework [9.76455227840645]
Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging. We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios.
arXiv Detail & Related papers (2025-04-21T07:48:20Z)
Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues [46.60683274479208]
We introduce an approach to train large language models (LLMs) to generate tutor utterances that maximize the likelihood of student correctness. We show that tutor utterances generated by our model lead to significantly higher chances of correct student responses.
arXiv Detail & Related papers (2025-03-09T03:38:55Z)
MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors [76.1634959528817]
We present MathTutorBench, an open-source benchmark for holistic tutoring model evaluation. MathTutorBench contains datasets and metrics that broadly cover tutor abilities as defined by learning sciences research in dialog-based teaching. We evaluate a wide set of closed- and open-weight models and find that subject expertise, indicated by solving ability, does not immediately translate to good teaching.
arXiv Detail & Related papers (2025-02-26T08:43:47Z)
Do Tutors Learn from Equity Training and Can Generative AI Assess It? [2.116573423199236]
We evaluate tutor performance within an online lesson on enhancing tutors' skills when responding to students in potentially inequitable situations. We find marginally significant learning gains with increases in tutors' self-reported confidence in their knowledge. This work makes available a dataset of lesson log data, tutor responses, rubrics for human annotation, and generative AI prompts.
arXiv Detail & Related papers (2024-12-15T17:36:40Z)
An Exploration of Higher Education Course Evaluation by Large Language Models [4.943165921136573]
Large language models (LLMs) within artificial intelligence (AI) present promising new avenues for enhancing course evaluation processes. This study explores the application of LLMs in automated course evaluation from multiple perspectives and conducts rigorous experiments across 100 courses at a major university in China.
arXiv Detail & Related papers (2024-11-03T20:43:52Z)
Optimizing the role of human evaluation in LLM-based spoken document summarization systems [0.0]
We propose an evaluation paradigm for spoken document summarization explicitly tailored for generative AI content. We provide detailed evaluation criteria and best practices guidelines to ensure robustness in the experimental design, replicability, and trustworthiness of human evaluations.
arXiv Detail & Related papers (2024-10-23T18:37:14Z)
Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models [30.759154473275043]
This study introduces a benchmark to evaluate the questioning capability in education as a teacher of large language models (LLMs) We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs' outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher.
arXiv Detail & Related papers (2024-08-20T15:36:30Z)
Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course [49.296957552006226]
Using large language models (LLMs) for automatic evaluation has become an important evaluation method in NLP research. This report shares how we use GPT-4 as an automatic assignment evaluator in a university course with 1,028 students.
arXiv Detail & Related papers (2024-07-07T00:17:24Z)
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z)
Large Language Models as Evaluators for Recommendation Explanations [23.938202791437337]
We investigate whether LLMs can serve as evaluators of recommendation explanations. We design and apply a 3-level meta evaluation strategy to measure the correlation between evaluator labels and the ground truth provided by users. Our study verifies that utilizing LLMs as evaluators can be an accurate, reproducible and cost-effective solution for evaluating recommendation explanation texts.
arXiv Detail & Related papers (2024-06-05T13:23:23Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z)
Modelling Assessment Rubrics through Bayesian Networks: a Pragmatic Approach [40.06500618820166]
This paper presents an approach to deriving a learner model directly from an assessment rubric. We illustrate how the approach can be applied to automatize the human assessment of an activity developed for testing computational thinking skills.
arXiv Detail & Related papers (2022-09-07T10:09:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.