Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue Evaluation
- URL: http://arxiv.org/abs/2504.13882v1
- Date: Thu, 03 Apr 2025 15:08:03 GMT
- Title: Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue Evaluation
- Authors: Megan Gu, Chloe Qianhui Zhao, Claire Liu, Nikhil Patel, Jahnvi Shah, Jionghao Lin, Kenneth R. Koedinger,
- Abstract summary: This study introduces an automated system leveraging large language models (LLMs) to assess the effectiveness of five key tutoring strategies.<n>Using a public dataset from the Teacher-Student Chatroom, our system classifies each tutoring strategy as either being employed as desired or undesired.
- Score: 1.7590081165362783
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Our study introduces an automated system leveraging large language models (LLMs) to assess the effectiveness of five key tutoring strategies: 1. giving effective praise, 2. reacting to errors, 3. determining what students know, 4. helping students manage inequity, and 5. responding to negative self-talk. Using a public dataset from the Teacher-Student Chatroom Corpus, our system classifies each tutoring strategy as either being employed as desired or undesired. Our study utilizes GPT-3.5 with few-shot prompting to assess the use of these strategies and analyze tutoring dialogues. The results show that for the five tutoring strategies, True Negative Rates (TNR) range from 0.655 to 0.738, and Recall ranges from 0.327 to 0.432, indicating that the model is effective at excluding incorrect classifications but struggles to consistently identify the correct strategy. The strategy \textit{helping students manage inequity} showed the highest performance with a TNR of 0.738 and Recall of 0.432. The study highlights the potential of LLMs in tutoring strategy analysis and outlines directions for future improvements, including incorporating more advanced models for more nuanced feedback.
Related papers
- Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues [48.99818550820575]
Recent studies have shown that strategies used by tutors can have significant effects on student outcomes.<n>Few works have studied predicting tutor strategy in dialogues.<n>We investigate the ability of modern LLMs, particularly Llama 3 and GPT-4o, to predict both future tutor moves and student outcomes in dialogues.
arXiv Detail & Related papers (2025-07-09T14:47:35Z) - Leveraging LLMs to Assess Tutor Moves in Real-Life Dialogues: A Feasibility Study [3.976073625291173]
We analyze 50 randomly selected transcripts of college-student remote tutors assisting middle school students in mathematics.<n>Using GPT-4, GPT-4o, GPT-4-turbo, Gemini-1.5-pro, and LearnLM, we assess tutors' application of two tutor skills: delivering effective praise and responding to student math errors.
arXiv Detail & Related papers (2025-06-20T18:13:33Z) - Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback [59.078756231841574]
Critique-GRPO is an online RL framework that integrates both natural language and numerical feedback for effective policy optimization.<n>We show Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks.
arXiv Detail & Related papers (2025-06-03T17:39:02Z) - From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning [76.09281171131941]
Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy.<n>We propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors.
arXiv Detail & Related papers (2025-05-21T15:00:07Z) - Single-Agent vs. Multi-Agent LLM Strategies for Automated Student Reflection Assessment [16.145339327301816]
Large Language Models (LLMs) can transform student reflections into quantitative scores.<n>LLMs can effectively automate reflection assessment, reduce educators' workload, and enable timely support for students.
arXiv Detail & Related papers (2025-04-08T06:34:15Z) - Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study [0.0]
Large Language Models (LLMs) hold promise as dynamic instructional aids.<n>Yet, it remains unclear whether LLMs can replicate the adaptivity of intelligent tutoring systems (ITS)
arXiv Detail & Related papers (2025-04-07T23:57:32Z) - Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues [46.60683274479208]
We introduce an approach to train large language models (LLMs) to generate tutor utterances that maximize the likelihood of student correctness.<n>We show that tutor utterances generated by our model lead to significantly higher chances of correct student responses.
arXiv Detail & Related papers (2025-03-09T03:38:55Z) - Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses [0.0]
This study aims to explore the potential of Large Language Models (LLMs) in facilitating automated feedback in math education.
We employ Mistral, a version of Llama catered to math, and fine-tune this model for evaluating student responses by leveraging a dataset of student responses and teacher-written feedback for middle-school math problems.
We evaluate the model's performance in scoring accuracy and the quality of feedback by utilizing judgments from 2 teachers.
arXiv Detail & Related papers (2024-10-29T16:57:45Z) - An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [29.81362106367831]
Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases.
In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluators automatically based on their inherent traits.
Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost.
arXiv Detail & Related papers (2024-10-16T06:06:06Z) - Evaluating Fine-Tuning Efficiency of Human-Inspired Learning Strategies in Medical Question Answering [1.912429179274357]
This study evaluates the fine-tuning efficiency of five human-inspired strategies across four language models, three datasets, and both human- and LLM-labelled data.
strategies achieve the best accuracy gain of 1.81% and an average gain of 1.02% across datasets.
arXiv Detail & Related papers (2024-08-15T02:22:48Z) - Evaluating and Optimizing Educational Content with Large Language Model Judgments [52.33701672559594]
We use Language Models (LMs) as educational experts to assess the impact of various instructions on learning outcomes.
We introduce an instruction optimization approach in which one LM generates instructional materials using the judgments of another LM as a reward function.
Human teachers' evaluations of these LM-generated worksheets show a significant alignment between the LM judgments and human teacher preferences.
arXiv Detail & Related papers (2024-03-05T09:09:15Z) - Improving the Validity of Automatically Generated Feedback via Reinforcement Learning [46.667783153759636]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL)<n>Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z) - Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models [81.27391252152199]
Large language models (LLMs) have achieved impressive performance across various natural language benchmarks.
We propose to automate dataset updating and provide systematic analysis regarding its effectiveness.
There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, and 2) extending strategy that further expands existing samples.
arXiv Detail & Related papers (2024-02-19T07:15:59Z) - MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs)
This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation.
By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z) - StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving [76.5322280307861]
StrategyLLM allows LLMs to perform inductive reasoning, deriving general strategies from specific task instances, and deductive reasoning, applying these general strategies to particular task examples, for constructing generalizable and consistent few-shot prompts.
Experimental results demonstrate that StrategyLLM outperforms the competitive baseline CoT-SC that requires human-annotated solutions on 13 datasets across 4 challenging tasks without human involvement, including math reasoning (34.2% $rightarrow$ 38.8%), commonsense reasoning (70.3% $rightarrow$ 72.5%), algorithmic reasoning (73.7% $rightarrow$ 85.0
arXiv Detail & Related papers (2023-11-15T09:18:09Z) - Opportunities and Challenges in Neural Dialog Tutoring [54.07241332881601]
We rigorously analyze various generative language models on two dialog tutoring datasets for language learning.
We find that although current approaches can model tutoring in constrained learning scenarios, they perform poorly in less constrained scenarios.
Our human quality evaluation shows that both models and ground-truth annotations exhibit low performance in terms of equitable tutoring.
arXiv Detail & Related papers (2023-01-24T11:00:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.