SEFL: Enhancing Educational Assignment Feedback with LLM Agents
- URL: http://arxiv.org/abs/2502.12927v2
- Date: Fri, 01 Aug 2025 13:19:46 GMT
- Title: SEFL: Enhancing Educational Assignment Feedback with LLM Agents
- Authors: Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay, Johannes Bjerva,
- Abstract summary: Synthetic Educational Feedback Loops (SEFL) is a synthetic data framework designed to generate data that resembles immediate, on-demand feedback at scale.<n>To get this type of data, two large language models (LLMs) operate in teacher-student roles to simulate assignment completion and formative feedback.<n>We show that SEFL-tuned models outperform both their non-tuned counterparts in feedback quality and an existing baseline.
- Score: 5.191286314473505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Providing high-quality feedback to student assignments is crucial for student success, but it is constrained by time and costs. In this work, we introduce Synthetic Educational Feedback Loops (SEFL), a synthetic data framework designed to generate data that resembles immediate, on-demand feedback at scale without relying on extensive, real-world student assignments. To get this type of data, two large language models (LLMs) operate in teacher-student roles to simulate assignment completion and formative feedback, generating synthetic pairs of student work and corresponding critiques and actionable improvements from a teacher. With this data, we fine-tune smaller, more computationally efficient LLMs on these synthetic pairs, enabling them to replicate key features of high-quality, goal-oriented feedback. Unlike personalized tutoring approaches that offer multi-turn, individualized instruction, SEFL specifically focuses on replicating the teacher-student assignment feedback loop in higher education. Through comprehensive evaluations with four LLM judges and three human experts, we demonstrate that SEFL-tuned models outperform both their non-tuned counterparts in feedback quality and an existing baseline. The potential for societal impact is reinforced by extensive qualitative comments by ratings by human stakeholders -- both students and higher education instructors. All in all, SEFL has substantial potential to transform feedback processes for higher education and beyond.
Related papers
- DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs [58.4911494598431]
DistiLLM-2 is a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses.<n>Our experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, but also supports diverse applications.
arXiv Detail & Related papers (2025-03-10T08:51:32Z) - Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones? [58.80794196076336]
Distilling large language models (LLMs) typically involves transferring the teacher model's responses through supervised fine-tuning (SFT)
We propose a novel distillation pipeline that transfers both responses and rewards.
Our method generates pseudo-rewards through a self-supervised mechanism that leverages the inherent structure of both teacher and student responses.
arXiv Detail & Related papers (2025-02-26T20:50:11Z) - FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users [111.56469697145519]
We propose Few-Shot Preference Optimization, which reframes reward modeling as a meta-learning problem.
Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them.
We generate over 1M synthetic personalized preferences using publicly available LLMs.
We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study.
arXiv Detail & Related papers (2025-02-26T17:08:46Z) - Automated Assignment Grading with Large Language Models: Insights From a Bioinformatics Course [0.0]
Natural language processing and large language models (LLMs) offer a promising solution by enabling the efficient delivery of personalized feedback.<n>Recent advances in natural language processing and large language models (LLMs) offer a promising solution by enabling the efficient delivery of personalized feedback.<n>Our results show that with well-designed prompts, LLMs can achieve grading accuracy and feedback quality comparable to human graders.
arXiv Detail & Related papers (2025-01-24T13:59:14Z) - Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z) - Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses [0.0]
This study aims to explore the potential of Large Language Models (LLMs) in facilitating automated feedback in math education.
We employ Mistral, a version of Llama catered to math, and fine-tune this model for evaluating student responses by leveraging a dataset of student responses and teacher-written feedback for middle-school math problems.
We evaluate the model's performance in scoring accuracy and the quality of feedback by utilizing judgments from 2 teachers.
arXiv Detail & Related papers (2024-10-29T16:57:45Z) - Training Language Models to Critique With Multi-agent Feedback [102.42751835338233]
MultiCritique pipeline improves critique ability of LLMs by utilizing multi-agent feedback.
pipeline aggregates high-quality critiques from multiple agents instead of a single model.
Our fine-tuned 7B model significantly surpasses other advanced 7B-13B open-source models.
arXiv Detail & Related papers (2024-10-20T04:57:45Z) - Closing the Loop: Learning to Generate Writing Feedback via Language Model Simulated Student Revisions [6.216542656489173]
We propose PROF that PROduces Feedback via learning from LM simulated student revisions.
We empirically test the efficacy of PROF and observe that our approach surpasses a variety of baseline methods in effectiveness of improving students' writing.
arXiv Detail & Related papers (2024-10-10T15:52:48Z) - "I understand why I got this grade": Automatic Short Answer Grading with Feedback [33.63970664152288]
We introduce Engineering Short Answer Feedback (EngSAF), a dataset designed for automatic short-answer grading with feedback.<n>We incorporate feedback into our dataset by leveraging the generative capabilities of state-of-the-art large language models (LLMs) using our Label-Aware Synthetic Feedback Generation (LASFG) strategy.<n>The best-performing model (Mistral-7B) achieves an overall accuracy of 75.4% and 58.7% on unseen answers and unseen question test sets, respectively.
arXiv Detail & Related papers (2024-06-30T15:42:18Z) - Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation [67.88747330066049]
Fine-grained feedback captures nuanced distinctions in image quality and prompt-alignment.
We show that demonstrating its superiority to coarse-grained feedback is not automatic.
We identify key challenges in eliciting and utilizing fine-grained feedback.
arXiv Detail & Related papers (2024-06-24T17:19:34Z) - UniFL: Improve Latent Diffusion Model via Unified Feedback Learning [61.66652875042845]
We present UniFL, a unified framework that leverages feedback learning to enhance diffusion models comprehensively.<n>UniFL consists of three key components: perceptual feedback learning, which enhances visual quality; decoupled feedback learning, which improves aesthetic appeal; and adversarial feedback learning, which accelerates inference.<n>In-depth experiments and extensive user studies validate the superior performance of our method in enhancing generation quality and inference acceleration.
arXiv Detail & Related papers (2024-04-08T15:14:20Z) - Evaluating and Optimizing Educational Content with Large Language Model Judgments [52.33701672559594]
We use Language Models (LMs) as educational experts to assess the impact of various instructions on learning outcomes.
We introduce an instruction optimization approach in which one LM generates instructional materials using the judgments of another LM as a reward function.
Human teachers' evaluations of these LM-generated worksheets show a significant alignment between the LM judgments and human teacher preferences.
arXiv Detail & Related papers (2024-03-05T09:09:15Z) - Improving the Validity of Automatically Generated Feedback via Reinforcement Learning [46.667783153759636]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL)<n>Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z) - Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning [39.73918872205541]
Many recent methods focus on improving the data quality but often overlook the compatibility of the data with the student model being finetuned.
This paper introduces Selective Reflection-Tuning, a novel paradigm that synergizes a teacher LLM's reflection and introspection for improving existing data quality.
This teacher-student collaboration produces high-quality and student-compatible instruction-response pairs, resulting in sample-efficient instruction tuning.
arXiv Detail & Related papers (2024-02-15T17:06:21Z) - Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs [13.262711792955377]
This study explores the effectiveness of Large Language Models (LLMs) for automated essay scoring.
We propose an open-source LLM-based AES system, inspired by the dual-process theory.
We find that our system not only automates the grading process but also enhances the performance and efficiency of human graders.
arXiv Detail & Related papers (2024-01-12T07:50:10Z) - DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback [61.28463542324576]
We present DRESS, a large vision language model (LVLM) that innovatively exploits Natural Language feedback (NLF) from Large Language Models.
We propose a novel categorization of the NLF into two key types: critique and refinement.
Our experimental results demonstrate that DRESS can generate more helpful (9.76%), honest (11.52%), and harmless (21.03%) responses.
arXiv Detail & Related papers (2023-11-16T18:37:29Z) - Constructive Large Language Models Alignment with Diverse Feedback [76.9578950893839]
We introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance large language models alignment.
We exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems.
By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data.
arXiv Detail & Related papers (2023-10-10T09:20:14Z) - PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine [24.888093229577965]
We propose a simple, universal, and automatic method named PREFER to address the stated limitations.
Our PREFER achieves state-of-the-art performance in multiple types of tasks by a significant margin.
arXiv Detail & Related papers (2023-08-23T09:46:37Z) - Aligning Large Language Models through Synthetic Feedback [43.84431341195111]
We propose a novel alignment learning framework with synthetic feedback not dependent on extensive human annotations.
In human evaluation, our model is preferred to Alpaca and Dolly-v2, 55.0% and 58.5% of the time, respectively.
arXiv Detail & Related papers (2023-05-23T06:41:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.