Related papers: Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents

Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents

URL: http://arxiv.org/abs/2511.11772v1
Date: Fri, 14 Nov 2025 09:46:21 GMT
Title: Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents
Authors: Chenyu Zhang, Xiaohang Luo,
Abstract summary: Formative feedback is one of the most effective drivers of student learning.<n>In large or low-resource courses, instructors often lack the time, staffing, and bandwidth required to review and respond to every student reflection.<n>This paper presents a theory-grounded system that uses five coordinated role-based LLM agents to score learner reflections.
Score: 2.825140278227664
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Formative feedback is widely recognized as one of the most effective drivers of student learning, yet it remains difficult to implement equitably at scale. In large or low-resource courses, instructors often lack the time, staffing, and bandwidth required to review and respond to every student reflection, creating gaps in support precisely where learners would benefit most. This paper presents a theory-grounded system that uses five coordinated role-based LLM agents (Evaluator, Equity Monitor, Metacognitive Coach, Aggregator, and Reflexion Reviewer) to score learner reflections with a shared rubric and to generate short, bias-aware, learner-facing comments. The agents first produce structured rubric scores, then check for potentially biased or exclusionary language, add metacognitive prompts that invite students to think about their own thinking, and finally compose a concise feedback message of at most 120 words. The system includes simple fairness checks that compare scoring error across lower and higher scoring learners, enabling instructors to monitor and bound disparities in accuracy. We evaluate the pipeline in a 12-session AI literacy program with adult learners. In this setting, the system produces rubric scores that approach expert-level agreement, and trained graders rate the AI-generated comments as helpful, empathetic, and well aligned with instructional goals. Taken together, these results show that multi-agent LLM systems can deliver equitable, high-quality formative feedback at a scale and speed that would be impossible for human graders alone. More broadly, the work points toward a future where feedback-rich learning becomes feasible for any course size or context, advancing long-standing goals of equity, access, and instructional capacity in education.

Related papers

LLM-based Multimodal Feedback Produces Equivalent Learning and Better Student Perceptions than Educator Feedback [4.225232488376583]
This study introduces a real-time AI-facilitated multimodal feedback system that integrates structured textual explanations with dynamic multimedia resources.<n>In an online crowdsourcing experiment, we compared this system against fixed business-as-usual feedback by educators across three dimensions.<n>Results showed that AI multimodal feedback achieved learning gains equivalent to original educator feedback while significantly outperforming it on perceived clarity, specificity, conciseness, motivation, satisfaction, and reducing cognitive load.
arXiv Detail & Related papers (2026-01-21T18:58:08Z)
Measuring Teaching with LLMs [4.061135251278187]
This paper uses custom Large Language Models built on sentence-level embeddings.<n>We show that these specialized models can achieve human-level and even super-human performance with expert human ratings above 0.65.<n>We also find that aggregate model scores align with teacher value-added measures, indicating they are capturing features relevant to student learning.
arXiv Detail & Related papers (2025-10-27T03:42:04Z)
Feedback Indicators: The Alignment between Llama and a Teacher in Language Learning [0.0]
It is essential first to extract relevant indicators, as these serve as the foundation upon which the feedback is constructed.<n>This study examines the initial phase of extracting such indicators from students' submissions of a language learning course using the large language model Llama 3.1.<n>The findings demonstrate statistically significant strong correlations, even in cases involving unanticipated combinations of indicators and criteria.
arXiv Detail & Related papers (2025-08-15T09:59:22Z)
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning [54.85131761693927]
We introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions.<n>Our core contribution lies in converting all judgment tasks for non-verifiable and verifiable prompts into a unified format with verifiable rewards.<n>We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance.
arXiv Detail & Related papers (2025-05-15T14:05:15Z)
Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study [0.0]
Large Language Models (LLMs) hold promise as dynamic instructional aids.<n>Yet, it remains unclear whether LLMs can replicate the adaptivity of intelligent tutoring systems (ITS)
arXiv Detail & Related papers (2025-04-07T23:57:32Z)
CoTAL: Human-in-the-Loop Prompt Engineering for Generalizable Formative Assessment Scoring [2.249916681499244]
Chain-of-Thought Prompting + Active Learning (CoTAL) is an Evidence-Centered Design (ECD)-based approach to formative assessment scoring.<n>Our findings demonstrate that CoTAL improves GPT-4's scoring performance across domains.
arXiv Detail & Related papers (2025-04-03T06:53:34Z)
"My Grade is Wrong!": A Contestable AI Framework for Interactive Feedback in Evaluating Student Essays [6.810086342993699]
This paper introduces CAELF, a Contestable AI Empowered LLM Framework for automating interactive feedback. CAELF allows students to query, challenge, and clarify their feedback by integrating a multi-agent system with computational argumentation. A case study on 500 critical thinking essays with user studies demonstrates that CAELF significantly improves interactive feedback.
arXiv Detail & Related papers (2024-09-11T17:59:01Z)
Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course [49.296957552006226]
Using large language models (LLMs) for automatic evaluation has become an important evaluation method in NLP research. This report shares how we use GPT-4 as an automatic assignment evaluator in a university course with 1,028 students.
arXiv Detail & Related papers (2024-07-07T00:17:24Z)
Learning to Generate Explainable Stock Predictions using Self-Reflective Large Language Models [54.21695754082441]
We propose a framework to teach Large Language Models (LLMs) to generate explainable stock predictions. A reflective agent learns how to explain past stock movements through self-reasoning, while the PPO trainer trains the model to generate the most likely explanations. Our framework can outperform both traditional deep-learning and LLM methods in prediction accuracy and Matthews correlation coefficient.
arXiv Detail & Related papers (2024-02-06T03:18:58Z)
Democratizing Reasoning Ability: Tailored Learning from Large Language Model [97.4921006089966]
We propose a tailored learning approach to distill such reasoning ability to smaller LMs. We exploit the potential of LLM as a reasoning teacher by building an interactive multi-round learning paradigm. To exploit the reasoning potential of the smaller LM, we propose self-reflection learning to motivate the student to learn from self-made mistakes.
arXiv Detail & Related papers (2023-10-20T07:50:10Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
PapagAI:Automated Feedback for Reflective Essays [48.4434976446053]
We present the first open-source automated feedback tool based on didactic theory and implemented as a hybrid AI system. The main objective of our work is to enable better learning outcomes for students and to complement the teaching activities of lecturers.
arXiv Detail & Related papers (2023-07-10T11:05:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.