Related papers: UCO: A Multi-Turn Interactive Reinforcement Learning Method for Adaptive Teaching with Large Language Models

UCO: A Multi-Turn Interactive Reinforcement Learning Method for Adaptive Teaching with Large Language Models

URL: http://arxiv.org/abs/2511.08873v1
Date: Thu, 13 Nov 2025 01:13:29 GMT
Title: UCO: A Multi-Turn Interactive Reinforcement Learning Method for Adaptive Teaching with Large Language Models
Authors: Shouang Wei, Min Zhang, Xin Lin, Bo Jiang, Kun Kuang, Zhongxiang Dai,
Abstract summary: Large language models (LLMs) are shifting from answer providers to intelligent tutors in educational settings.<n>Recent reinforcement learning approaches address this limitation but face two critical challenges.<n>We propose the Unidirectional Cognitive Optimization (UCO) method to address these challenges.
Score: 59.693733170193944
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are shifting from answer providers to intelligent tutors in educational settings, yet current supervised fine-tuning methods only learn surface teaching patterns without dynamic adaptation capabilities. Recent reinforcement learning approaches address this limitation but face two critical challenges. First, they evaluate teaching effectiveness solely based on whether students produce correct outputs, unable to distinguish whether students genuinely understand or echo teacher-provided answers during interaction. Second, they cannot perceive students' evolving cognitive states in real time through interactive dialogue, thus failing to adapt teaching strategies to match students' cognitive levels dynamically. We propose the Unidirectional Cognitive Optimization (UCO) method to address these challenges. UCO uses a multi-turn interactive reinforcement learning paradigm where the innovation lies in two synergistic reward functions: the Progress Reward captures students' cognitive advancement, evaluating whether students truly transition from confusion to comprehension, while the Scaffold Reward dynamically identifies each student's Zone of Proximal Development (ZPD), encouraging teachers to maintain productive teaching within this zone. We evaluate UCO by comparing it against 11 baseline models on BigMath and MathTutorBench benchmarks. Experimental results demonstrate that our UCO model outperforms all models of equivalent scale and achieves performance comparable to advanced closed-source models. The code and data are available at https://github.com/Mind-Lab-ECNU/UCO.

Related papers

Investigating Pedagogical Teacher and Student LLM Agents: Genetic Adaptation Meets Retrieval Augmented Generation Across Learning Style [16.985943868964394]
Effective teaching requires adapting instructional strategies to accommodate the diverse cognitive and behavioral profiles of students.<n>This paper introduces a novel simulation framework that integrates heterogeneous student agents with a self-optimizing teacher agent.<n>Our results highlight the potential of LLM-driven simulations to inform adaptive teaching practices and provide a testbed for training human educators in data-driven environments.
arXiv Detail & Related papers (2025-05-25T14:45:35Z)
Improving Question Embeddings with Cognitive Representation Optimization for Knowledge Tracing [77.14348157016518]
Research on KT modeling focuses on predicting future student performance based on existing, unupdated records of student learning interactions.<n>We propose a knowledge-tracking cognitive representation optimization (CRO-KT) model that uses dynamic programming algorithms to optimize the structure of cognitive representation.
arXiv Detail & Related papers (2025-04-05T09:32:03Z)
Toward In-Context Teaching: Adapting Examples to Students' Misconceptions [54.82965010592045]
We introduce a suite of models and evaluation methods we call AdapT. AToM is a new probabilistic model for adaptive teaching that jointly infers students' past beliefs and optimize for the correctness of future beliefs. Our results highlight both the difficulty of the adaptive teaching task and the potential of learned adaptive models for solving it.
arXiv Detail & Related papers (2024-05-07T17:05:27Z)
Learn to Teach: Sample-Efficient Privileged Learning for Humanoid Locomotion over Diverse Terrains [6.967583364984562]
This work proposes a novel one-stage training framework-Learn to Teach (L2T)-which unifies teacher and student policy learning.<n>Our approach recycles simulator samples and synchronizes the learning trajectories through shared dynamics, significantly reducing sample complexities and training time.<n>We validate the RL variant (L2T-RL) through extensive simulations and hardware tests on the Digit robot, demonstrating zero-shot sim-to-real transfer and robust performance over 12+ challenging terrains without depth estimation modules.
arXiv Detail & Related papers (2024-02-09T21:16:43Z)
YODA: Teacher-Student Progressive Learning for Language Models [82.0172215948963]
This paper introduces YODA, a teacher-student progressive learning framework. It emulates the teacher-student education process to improve the efficacy of model fine-tuning. Experiments show that training LLaMA2 with data from YODA improves SFT with significant performance gain.
arXiv Detail & Related papers (2024-01-28T14:32:15Z)
RLIF: Interactive Imitation Learning as Reinforcement Learning [56.997263135104504]
We show how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning. Our proposed method uses reinforcement learning with user intervention signals themselves as rewards. This relaxes the assumption that intervening experts in interactive imitation learning should be near-optimal and enables the algorithm to learn behaviors that improve over the potential suboptimal human expert.
arXiv Detail & Related papers (2023-11-21T21:05:21Z)
Distantly-Supervised Named Entity Recognition with Adaptive Teacher Learning and Fine-grained Student Ensemble [56.705249154629264]
Self-training teacher-student frameworks are proposed to improve the robustness of NER models. In this paper, we propose an adaptive teacher learning comprised of two teacher-student networks. Fine-grained student ensemble updates each fragment of the teacher model with a temporal moving average of the corresponding fragment of the student, which enhances consistent predictions on each model fragment against noise.
arXiv Detail & Related papers (2022-12-13T12:14:09Z)
Concurrent Training Improves the Performance of Behavioral Cloning from Observation [10.939683083130616]
Learning from demonstration is widely used as an efficient way for robots to acquire new skills. Learning from observation offers a way to utilize unlabeled demonstrations (e.g., video) to perform imitation learning. One approach to this is behavioral cloning from observation (BCO) We present a novel theoretical analysis of BCO, introduce a modification BCO*, and show that in the semi-supervised setting, BCO* can concurrently improve both its estimate for the inverse dynamics model and the expert policy.
arXiv Detail & Related papers (2020-08-03T21:30:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.