Using Large Language Models to Assess Tutors' Performance in Reacting to
Students Making Math Errors
- URL: http://arxiv.org/abs/2401.03238v1
- Date: Sat, 6 Jan 2024 15:34:27 GMT
- Title: Using Large Language Models to Assess Tutors' Performance in Reacting to
Students Making Math Errors
- Authors: Sanjit Kakarla, Danielle Thomas, Jionghao Lin, Shivang Gupta, Kenneth
R. Koedinger
- Abstract summary: We investigate the capacity of generative AI to evaluate real-life tutors' performance in responding to students making math errors.
By analyzing 50 real-life tutoring dialogues, we find both GPT-3.5-Turbo and GPT-4 demonstrate proficiency in assessing the criteria related to reacting to students making errors.
GPT-4 tends to overidentify instances of students making errors, often attributing student uncertainty or inferring potential errors where human evaluators did not.
- Score: 2.099922236065961
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Research suggests that tutors should adopt a strategic approach when
addressing math errors made by low-efficacy students. Rather than drawing
direct attention to the error, tutors should guide the students to identify and
correct their mistakes on their own. While tutor lessons have introduced this
pedagogical skill, human evaluation of tutors applying this strategy is arduous
and time-consuming. Large language models (LLMs) show promise in providing
real-time assessment to tutors during their actual tutoring sessions, yet
little is known regarding their accuracy in this context. In this study, we
investigate the capacity of generative AI to evaluate real-life tutors'
performance in responding to students making math errors. By analyzing 50
real-life tutoring dialogues, we find both GPT-3.5-Turbo and GPT-4 demonstrate
proficiency in assessing the criteria related to reacting to students making
errors. However, both models exhibit limitations in recognizing instances where
the student made an error. Notably, GPT-4 tends to overidentify instances of
students making errors, often attributing student uncertainty or inferring
potential errors where human evaluators did not. Future work will focus on
enhancing generalizability by assessing a larger dataset of dialogues and
evaluating learning transfer. Specifically, we will analyze the performance of
tutors in real-life scenarios when responding to students' math errors before
and after lesson completion on this crucial tutoring skill.
Related papers
- Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors [78.53699244846285]
Large language models (LLMs) present an opportunity to scale high-quality personalized education to all.
LLMs struggle to precisely detect student's errors and tailor their feedback to these errors.
Inspired by real-world teaching practice where teachers identify student errors and customize their response based on them, we focus on verifying student solutions.
arXiv Detail & Related papers (2024-07-12T10:11:40Z) - Evaluating and Optimizing Educational Content with Large Language Model Judgments [52.33701672559594]
We use Language Models (LMs) as educational experts to assess the impact of various instructions on learning outcomes.
We introduce an instruction optimization approach in which one LM generates instructional materials using the judgments of another LM as a reward function.
Human teachers' evaluations of these LM-generated worksheets show a significant alignment between the LM judgments and human teacher preferences.
arXiv Detail & Related papers (2024-03-05T09:09:15Z) - Improving Assessment of Tutoring Practices using Retrieval-Augmented
Generation [10.419430731115405]
One-on-one tutoring is an effective instructional method for enhancing learning, yet its efficacy hinges on tutor competencies.
This study aims to harness Generative Pre-trained Transformers (GPT), such as GPT-3.5 and GPT-4 models, to automatically assess tutors' ability of using social-emotional tutoring strategies.
arXiv Detail & Related papers (2024-02-04T20:42:30Z) - Comparative Analysis of GPT-4 and Human Graders in Evaluating Praise
Given to Students in Synthetic Dialogues [2.3361634876233817]
Large language models, such as the AI-chatbot ChatGPT, hold potential for offering constructive feedback to tutors in practical settings.
The accuracy of AI-generated feedback remains uncertain, with scant research investigating the ability of models like ChatGPT to deliver effective feedback.
arXiv Detail & Related papers (2023-07-05T04:14:01Z) - Can Language Models Teach Weaker Agents? Teacher Explanations Improve
Students via Personalization [84.86241161706911]
We show that teacher LLMs can indeed intervene on student reasoning to improve their performance.
We also demonstrate that in multi-turn interactions, teacher explanations generalize and learn from explained data.
We verify that misaligned teachers can lower student performance to random chance by intentionally misleading them.
arXiv Detail & Related papers (2023-06-15T17:27:20Z) - MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties
Grounded in Math Reasoning Problems [74.73881579517055]
We propose a framework to generate such dialogues by pairing human teachers with a Large Language Model prompted to represent common student errors.
We describe how we use this framework to collect MathDial, a dataset of 3k one-to-one teacher-student tutoring dialogues.
arXiv Detail & Related papers (2023-05-23T21:44:56Z) - Opportunities and Challenges in Neural Dialog Tutoring [54.07241332881601]
We rigorously analyze various generative language models on two dialog tutoring datasets for language learning.
We find that although current approaches can model tutoring in constrained learning scenarios, they perform poorly in less constrained scenarios.
Our human quality evaluation shows that both models and ground-truth annotations exhibit low performance in terms of equitable tutoring.
arXiv Detail & Related papers (2023-01-24T11:00:17Z) - Distantly-Supervised Named Entity Recognition with Adaptive Teacher
Learning and Fine-grained Student Ensemble [56.705249154629264]
Self-training teacher-student frameworks are proposed to improve the robustness of NER models.
In this paper, we propose an adaptive teacher learning comprised of two teacher-student networks.
Fine-grained student ensemble updates each fragment of the teacher model with a temporal moving average of the corresponding fragment of the student, which enhances consistent predictions on each model fragment against noise.
arXiv Detail & Related papers (2022-12-13T12:14:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.