Related papers: Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue Evaluation

Related papers

PATS: Personality-Aware Teaching Strategies with Large Language Model Tutors [66.56586559631516]
Large language models (LLMs) have potential as educational tutors.<n>But different tutoring strategies benefit different student personalities.<n>Despite this, current LLM tutoring systems do not take into account student personality traits.
arXiv Detail & Related papers (2026-01-13T10:17:26Z)
TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models [10.963195858672627]
TutorBench is a dataset and evaluation benchmark designed to rigorously evaluate the core tutoring skills of large language models (LLMs)<n>Samples are drawn from three common tutoring tasks: (i) generating adaptive explanations tailored to a student's confusion, (ii) providing actionable feedback on a student's work, and (iii) promoting active learning through effective hint generation.<n>We evaluate 16 frontier LLMs on TutorBench and present a detailed analysis of their performance and behavior.
arXiv Detail & Related papers (2025-10-03T01:41:09Z)
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model [99.71684530652942]
We show that LLaVA-Critic-R1 emerges as a top-performing critic but also as a competitive policy model.<n>Applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks.<n>Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation.
arXiv Detail & Related papers (2025-08-31T03:08:02Z)
Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues [48.99818550820575]
Recent studies have shown that strategies used by tutors can have significant effects on student outcomes.<n>Few works have studied predicting tutor strategy in dialogues.<n>We investigate the ability of modern LLMs, particularly Llama 3 and GPT-4o, to predict both future tutor moves and student outcomes in dialogues.
arXiv Detail & Related papers (2025-07-09T14:47:35Z)
Leveraging LLMs to Assess Tutor Moves in Real-Life Dialogues: A Feasibility Study [3.976073625291173]
We analyze 50 randomly selected transcripts of college-student remote tutors assisting middle school students in mathematics.<n>Using GPT-4, GPT-4o, GPT-4-turbo, Gemini-1.5-pro, and LearnLM, we assess tutors' application of two tutor skills: delivering effective praise and responding to student math errors.
arXiv Detail & Related papers (2025-06-20T18:13:33Z)
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback [59.078756231841574]
Critique-GRPO is an online RL framework that integrates both natural language and numerical feedback for effective policy optimization.<n>We show Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks.
arXiv Detail & Related papers (2025-06-03T17:39:02Z)
From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning [76.09281171131941]
Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy.<n>We propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors.
arXiv Detail & Related papers (2025-05-21T15:00:07Z)
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning [54.85131761693927]
We introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions.<n>Our core contribution lies in converting all judgment tasks for non-verifiable and verifiable prompts into a unified format with verifiable rewards.<n>We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance.
arXiv Detail & Related papers (2025-05-15T14:05:15Z)
Single-Agent vs. Multi-Agent LLM Strategies for Automated Student Reflection Assessment [16.145339327301816]
Large Language Models (LLMs) can transform student reflections into quantitative scores.<n>LLMs can effectively automate reflection assessment, reduce educators' workload, and enable timely support for students.
arXiv Detail & Related papers (2025-04-08T06:34:15Z)
Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study [0.0]
Large Language Models (LLMs) hold promise as dynamic instructional aids.<n>Yet, it remains unclear whether LLMs can replicate the adaptivity of intelligent tutoring systems (ITS)
arXiv Detail & Related papers (2025-04-07T23:57:32Z)
Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues [46.60683274479208]
We introduce an approach to train large language models (LLMs) to generate tutor utterances that maximize the likelihood of student correctness.<n>We show that tutor utterances generated by our model lead to significantly higher chances of correct student responses.
arXiv Detail & Related papers (2025-03-09T03:38:55Z)
Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses [0.0]
This study aims to explore the potential of Large Language Models (LLMs) in facilitating automated feedback in math education. We employ Mistral, a version of Llama catered to math, and fine-tune this model for evaluating student responses by leveraging a dataset of student responses and teacher-written feedback for middle-school math problems. We evaluate the model's performance in scoring accuracy and the quality of feedback by utilizing judgments from 2 teachers.
arXiv Detail & Related papers (2024-10-29T16:57:45Z)
An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [29.81362106367831]
Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases. In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluators automatically based on their inherent traits. Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost.
arXiv Detail & Related papers (2024-10-16T06:06:06Z)
Evaluating Fine-Tuning Efficiency of Human-Inspired Learning Strategies in Medical Question Answering [1.912429179274357]
This study evaluates the fine-tuning efficiency of five human-inspired strategies across four language models, three datasets, and both human- and LLM-labelled data. strategies achieve the best accuracy gain of 1.81% and an average gain of 1.02% across datasets.
arXiv Detail & Related papers (2024-08-15T02:22:48Z)
Evaluating and Optimizing Educational Content with Large Language Model Judgments [52.33701672559594]
We use Language Models (LMs) as educational experts to assess the impact of various instructions on learning outcomes. We introduce an instruction optimization approach in which one LM generates instructional materials using the judgments of another LM as a reward function. Human teachers' evaluations of these LM-generated worksheets show a significant alignment between the LM judgments and human teacher preferences.
arXiv Detail & Related papers (2024-03-05T09:09:15Z)
Improving the Validity of Automatically Generated Feedback via Reinforcement Learning [46.667783153759636]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL)<n>Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z)
Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models [81.27391252152199]
Large language models (LLMs) have achieved impressive performance across various natural language benchmarks. We propose to automate dataset updating and provide systematic analysis regarding its effectiveness. There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, and 2) extending strategy that further expands existing samples.
arXiv Detail & Related papers (2024-02-19T07:15:59Z)
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs) This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z)
StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving [76.5322280307861]
StrategyLLM allows LLMs to perform inductive reasoning, deriving general strategies from specific task instances, and deductive reasoning, applying these general strategies to particular task examples, for constructing generalizable and consistent few-shot prompts. Experimental results demonstrate that StrategyLLM outperforms the competitive baseline CoT-SC that requires human-annotated solutions on 13 datasets across 4 challenging tasks without human involvement, including math reasoning (34.2% $rightarrow$ 38.8%), commonsense reasoning (70.3% $rightarrow$ 72.5%), algorithmic reasoning (73.7% $rightarrow$ 85.0
arXiv Detail & Related papers (2023-11-15T09:18:09Z)
Opportunities and Challenges in Neural Dialog Tutoring [54.07241332881601]
We rigorously analyze various generative language models on two dialog tutoring datasets for language learning. We find that although current approaches can model tutoring in constrained learning scenarios, they perform poorly in less constrained scenarios. Our human quality evaluation shows that both models and ground-truth annotations exhibit low performance in terms of equitable tutoring.
arXiv Detail & Related papers (2023-01-24T11:00:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.