Related papers: Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses

Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses

URL: http://arxiv.org/abs/2411.08910v1
Date: Tue, 29 Oct 2024 16:57:45 GMT
Title: Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses
Authors: Sami Baral, Eamon Worden, Wen-Chiang Lim, Zhuang Luo, Christopher Santorelli, Ashish Gurung, Neil Heffernan,
Abstract summary: This study aims to explore the potential of Large Language Models (LLMs) in facilitating automated feedback in math education. We employ Mistral, a version of Llama catered to math, and fine-tune this model for evaluating student responses by leveraging a dataset of student responses and teacher-written feedback for middle-school math problems. We evaluate the model's performance in scoring accuracy and the quality of feedback by utilizing judgments from 2 teachers.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The effectiveness of feedback in enhancing learning outcomes is well documented within Educational Data Mining (EDM). Various prior research has explored methodologies to enhance the effectiveness of feedback. Recent developments in Large Language Models (LLMs) have extended their utility in enhancing automated feedback systems. This study aims to explore the potential of LLMs in facilitating automated feedback in math education. We examine the effectiveness of LLMs in evaluating student responses by comparing 3 different models: Llama, SBERT-Canberra, and GPT4 model. The evaluation requires the model to provide both a quantitative score and qualitative feedback on the student's responses to open-ended math problems. We employ Mistral, a version of Llama catered to math, and fine-tune this model for evaluating student responses by leveraging a dataset of student responses and teacher-written feedback for middle-school math problems. A similar approach was taken for training the SBERT model as well, while the GPT4 model used a zero-shot learning approach. We evaluate the model's performance in scoring accuracy and the quality of feedback by utilizing judgments from 2 teachers. The teachers utilized a shared rubric in assessing the accuracy and relevance of the generated feedback. We conduct both quantitative and qualitative analyses of the model performance. By offering a detailed comparison of these methods, this study aims to further the ongoing development of automated feedback systems and outlines potential future directions for leveraging generative LLMs to create more personalized learning experiences.

Related papers

Personalized and Constructive Feedback for Computer Science Students Using the Large Language Model (LLM) [0.8409304328108455]
This paper investigates the performance of Large Language Models (LLMs) in processing students assessments with predefined rubrics and marking criteria.<n>We aim to leverage the power of existing LLMs for Marking Assessments, Tracking, and Evaluation (LLM-MATE) with personalized feedback to enhance students learning.
arXiv Detail & Related papers (2025-10-13T15:59:30Z)
EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving [61.99289768925256]
EvaLearn is a benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks.<n>We benchmark nine frontier models and observe varied performance profiles.<n>We observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks.
arXiv Detail & Related papers (2025-06-03T09:18:33Z)
Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge [82.19239596089089]
We introduce AGDe-Judge, a framework designed to debias from both the labels and feedbacks in the training data.<n>AGDe-Judge effectively reduces teacher preference bias while maintaining strong performance across six evaluation benchmarks.
arXiv Detail & Related papers (2025-05-25T14:48:49Z)
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset. We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z)
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z)
An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [29.81362106367831]
Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases. In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluators automatically based on their inherent traits. Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost.
arXiv Detail & Related papers (2024-10-16T06:06:06Z)
Open Source Language Models Can Provide Feedback: Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge [4.981275578987307]
Large language models (LLMs) have shown great potential for the automatic generation of feedback in a wide range of computing contexts. However, concerns have been voiced around the privacy and ethical implications of sending student work to proprietary models. This has sparked considerable interest in the use of open source LLMs in education, but the quality of the feedback that such open models can produce remains understudied.
arXiv Detail & Related papers (2024-05-08T17:57:39Z)
Investigating Automatic Scoring and Feedback using Large Language Models [46.1232919707345]
This paper explores the efficacy of PEFT-based quantized models, employing classification or regression head, to fine-tune language models for automatic grading and feedback generation. The results show that prediction of grade scores via finetuned LLMs are highly accurate, achieving less than 3% error in grade percentage on average.
arXiv Detail & Related papers (2024-05-01T16:13:54Z)
Evaluating and Optimizing Educational Content with Large Language Model Judgments [52.33701672559594]
We use Language Models (LMs) as educational experts to assess the impact of various instructions on learning outcomes. We introduce an instruction optimization approach in which one LM generates instructional materials using the judgments of another LM as a reward function. Human teachers' evaluations of these LM-generated worksheets show a significant alignment between the LM judgments and human teacher preferences.
arXiv Detail & Related papers (2024-03-05T09:09:15Z)
Improving the Validity of Automatically Generated Feedback via Reinforcement Learning [50.067342343957876]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL) Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z)
Teaching Language Models to Self-Improve through Interactive Demonstrations [83.9421355808174]
Self-improving ability of large language models has been shown to be absent and difficult to learn for smaller models. We introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability. We show that our approach can improve a LLaMA-7b's performance on math and reasoning tasks by up to 7.13%.
arXiv Detail & Related papers (2023-10-20T14:11:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.