Related papers: Leveraging LLMs to Assess Tutor Moves in Real-Life Dialogues: A Feasibility Study

Leveraging LLMs to Assess Tutor Moves in Real-Life Dialogues: A Feasibility Study

URL: http://arxiv.org/abs/2506.17410v1
Date: Fri, 20 Jun 2025 18:13:33 GMT
Title: Leveraging LLMs to Assess Tutor Moves in Real-Life Dialogues: A Feasibility Study
Authors: Danielle R. Thomas, Conrad Borchers, Jionghao Lin, Sanjit Kakarla, Shambhavi Bhushan, Erin Gatz, Shivang Gupta, Ralph Abboud, Kenneth R. Koedinger,
Abstract summary: We analyze 50 randomly selected transcripts of college-student remote tutors assisting middle school students in mathematics.<n>Using GPT-4, GPT-4o, GPT-4-turbo, Gemini-1.5-pro, and LearnLM, we assess tutors' application of two tutor skills: delivering effective praise and responding to student math errors.
Score: 3.976073625291173
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tutoring improves student achievement, but identifying and studying what tutoring actions are most associated with student learning at scale based on audio transcriptions is an open research problem. This present study investigates the feasibility and scalability of using generative AI to identify and evaluate specific tutor moves in real-life math tutoring. We analyze 50 randomly selected transcripts of college-student remote tutors assisting middle school students in mathematics. Using GPT-4, GPT-4o, GPT-4-turbo, Gemini-1.5-pro, and LearnLM, we assess tutors' application of two tutor skills: delivering effective praise and responding to student math errors. All models reliably detected relevant situations, for example, tutors providing praise to students (94-98% accuracy) and a student making a math error (82-88% accuracy) and effectively evaluated the tutors' adherence to tutoring best practices, aligning closely with human judgments (83-89% and 73-77%, respectively). We propose a cost-effective prompting strategy and discuss practical implications for using large language models to support scalable assessment in authentic settings. This work further contributes LLM prompts to support reproducibility and research in AI-supported learning.

Related papers

Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues [48.99818550820575]
Recent studies have shown that strategies used by tutors can have significant effects on student outcomes.<n>Few works have studied predicting tutor strategy in dialogues.<n>We investigate the ability of modern LLMs, particularly Llama 3 and GPT-4o, to predict both future tutor moves and student outcomes in dialogues.
arXiv Detail & Related papers (2025-07-09T14:47:35Z)
Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues [46.60683274479208]
We introduce an approach to train large language models (LLMs) to generate tutor utterances that maximize the likelihood of student correctness.<n>We show that tutor utterances generated by our model lead to significantly higher chances of correct student responses.
arXiv Detail & Related papers (2025-03-09T03:38:55Z)
Do Tutors Learn from Equity Training and Can Generative AI Assess It? [2.116573423199236]
We evaluate tutor performance within an online lesson on enhancing tutors' skills when responding to students in potentially inequitable situations.<n>We find marginally significant learning gains with increases in tutors' self-reported confidence in their knowledge.<n>This work makes available a dataset of lesson log data, tutor responses, rubrics for human annotation, and generative AI prompts.
arXiv Detail & Related papers (2024-12-15T17:36:40Z)
Towards the Pedagogical Steering of Large Language Models for Tutoring: A Case Study with Modeling Productive Failure [36.83786872708736]
One-to-one tutoring is one of the most efficient methods of teaching.<n>We develop StratL, an algorithm to optimize LLM prompts and steer it to follow a predefined multi-turn tutoring plan represented as a transition graph.<n>As a case study, we create a prototype tutor for high school math following Productive Failure (PF), an advanced and effective learning design.
arXiv Detail & Related papers (2024-10-03T16:15:41Z)
Exploring Knowledge Tracing in Tutor-Student Dialogues using LLMs [49.18567856499736]
We investigate whether large language models (LLMs) can be supportive of open-ended dialogue tutoring.<n>We apply a range of knowledge tracing (KT) methods on the resulting labeled data to track student knowledge levels over an entire dialogue.<n>We conduct experiments on two tutoring dialogue datasets, and show that a novel yet simple LLM-based method, LLMKT, significantly outperforms existing KT methods in predicting student response correctness in dialogues.
arXiv Detail & Related papers (2024-09-24T22:31:39Z)
GPT-4 as a Homework Tutor can Improve Student Engagement and Learning Outcomes [80.60912258178045]
We developed a prompting strategy that enables GPT-4 to conduct interactive homework sessions for high-school students learning English as a second language. We carried out a Randomized Controlled Trial (RCT) in four high-school classes, replacing traditional homework with GPT-4 homework sessions for the treatment group. We observed significant improvements in learning outcomes, specifically a greater gain in grammar, and student engagement.
arXiv Detail & Related papers (2024-09-24T11:22:55Z)
Evaluating and Optimizing Educational Content with Large Language Model Judgments [52.33701672559594]
We use Language Models (LMs) as educational experts to assess the impact of various instructions on learning outcomes. We introduce an instruction optimization approach in which one LM generates instructional materials using the judgments of another LM as a reward function. Human teachers' evaluations of these LM-generated worksheets show a significant alignment between the LM judgments and human teacher preferences.
arXiv Detail & Related papers (2024-03-05T09:09:15Z)
Using Large Language Models to Assess Tutors' Performance in Reacting to Students Making Math Errors [2.099922236065961]
We investigate the capacity of generative AI to evaluate real-life tutors' performance in responding to students making math errors. By analyzing 50 real-life tutoring dialogues, we find both GPT-3.5-Turbo and GPT-4 demonstrate proficiency in assessing the criteria related to reacting to students making errors. GPT-4 tends to overidentify instances of students making errors, often attributing student uncertainty or inferring potential errors where human evaluators did not.
arXiv Detail & Related papers (2024-01-06T15:34:27Z)
MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems [74.73881579517055]
We propose a framework to generate such dialogues by pairing human teachers with a Large Language Model prompted to represent common student errors. We describe how we use this framework to collect MathDial, a dataset of 3k one-to-one teacher-student tutoring dialogues.
arXiv Detail & Related papers (2023-05-23T21:44:56Z)
Reinforcement Learning Tutor Better Supported Lower Performers in a Math Task [32.6507926764587]
Reinforcement learning could be a key tool to reduce the development cost and improve the effectiveness of intelligent tutoring software. We show that deep reinforcement learning can be used to provide adaptive pedagogical support to students learning about the concept of volume.
arXiv Detail & Related papers (2023-04-11T02:11:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.