On the effectiveness of LLMs for automatic grading of open-ended questions in Spanish
- URL: http://arxiv.org/abs/2503.18072v1
- Date: Sun, 23 Mar 2025 13:43:27 GMT
- Title: On the effectiveness of LLMs for automatic grading of open-ended questions in Spanish
- Authors: Germán Capdehourat, Isabel Amigo, Brian Lorenzo, Joaquín Trigo,
- Abstract summary: This paper explores the performance of different LLMs and prompting techniques in automatically grading short-text answers to open-ended questions.<n>Results are notably sensitive to prompt styles, suggesting biases toward certain words or content in the prompt.
- Score: 0.8224695424591679
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Grading is a time-consuming and laborious task that educators must face. It is an important task since it provides feedback signals to learners, and it has been demonstrated that timely feedback improves the learning process. In recent years, the irruption of LLMs has shed light on the effectiveness of automatic grading. In this paper, we explore the performance of different LLMs and prompting techniques in automatically grading short-text answers to open-ended questions. Unlike most of the literature, our study focuses on a use case where the questions, answers, and prompts are all in Spanish. Experimental results comparing automatic scores to those of human-expert evaluators show good outcomes in terms of accuracy, precision and consistency for advanced LLMs, both open and proprietary. Results are notably sensitive to prompt styles, suggesting biases toward certain words or content in the prompt. However, the best combinations of models and prompt strategies, consistently surpasses an accuracy of 95% in a three-level grading task, which even rises up to more than 98% when the it is simplified to a binary right or wrong rating problem, which demonstrates the potential that LLMs have to implement this type of automation in education applications.
Related papers
- Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study [0.0]
Large Language Models (LLMs) hold promise as dynamic instructional aids.
Yet, it remains unclear whether LLMs can replicate the adaptivity of intelligent tutoring systems (ITS)
arXiv Detail & Related papers (2025-04-07T23:57:32Z) - LLMs Can Generate a Better Answer by Aggregating Their Own Responses [83.69632759174405]
Large Language Models (LLMs) have shown remarkable capabilities across tasks, yet they often require additional prompting techniques when facing complex problems.<n>We argue this limitation stems from the fact that common LLM post-training procedures lack explicit supervision for discriminative judgment tasks.<n>We propose Generative Self-Aggregation (GSA), a novel prompting method that improves answer quality without requiring the model's discriminative capabilities.
arXiv Detail & Related papers (2025-03-06T05:25:43Z) - Use Me Wisely: AI-Driven Assessment for LLM Prompting Skills Development [5.559706293891474]
Large language model (LLM)-powered chatbots have become popular across various domains, supporting a range of tasks and processes.
Yet, prompting is highly task- and domain-dependent, limiting the effectiveness of generic approaches.
In this study, we explore whether LLM-based methods can facilitate learning assessments by using ad-hoc guidelines and a minimal number of annotated prompt samples.
arXiv Detail & Related papers (2025-03-04T11:56:33Z) - Automated Assignment Grading with Large Language Models: Insights From a Bioinformatics Course [0.0]
Natural language processing and large language models (LLMs) offer a promising solution by enabling the efficient delivery of personalized feedback.<n>Recent advances in natural language processing and large language models (LLMs) offer a promising solution by enabling the efficient delivery of personalized feedback.<n>Our results show that with well-designed prompts, LLMs can achieve grading accuracy and feedback quality comparable to human graders.
arXiv Detail & Related papers (2025-01-24T13:59:14Z) - Exploring Knowledge Tracing in Tutor-Student Dialogues using LLMs [49.18567856499736]
We investigate whether large language models (LLMs) can be supportive of open-ended dialogue tutoring.<n>We apply a range of knowledge tracing (KT) methods on the resulting labeled data to track student knowledge levels over an entire dialogue.<n>We conduct experiments on two tutoring dialogue datasets, and show that a novel yet simple LLM-based method, LLMKT, significantly outperforms existing KT methods in predicting student response correctness in dialogues.
arXiv Detail & Related papers (2024-09-24T22:31:39Z) - Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring [21.7782670140939]
Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments.<n>While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear.<n>This paper uncovers the grading rubrics that LLMs used to score students' written responses to science tasks and their alignment with human scores.
arXiv Detail & Related papers (2024-07-04T22:26:20Z) - Fact-and-Reflection (FaR) Improves Confidence Calibration of Large Language Models [84.94220787791389]
We propose Fact-and-Reflection (FaR) prompting, which improves the LLM calibration in two steps.
Experiments show that FaR achieves significantly better calibration; it lowers the Expected Error by 23.5%.
FaR even elicits the capability of verbally expressing concerns in less confident scenarios.
arXiv Detail & Related papers (2024-02-27T01:37:23Z) - Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans.
RaR serves as a simple yet effective prompting method for improving performance.
We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z) - Towards LLM-based Autograding for Short Textual Answers [4.853810201626855]
This manuscript is an evaluation of a large language model for the purpose of autograding.
Our findings suggest that while "out-of-the-box" LLMs provide a valuable tool, their readiness for independent automated grading remains a work in progress.
arXiv Detail & Related papers (2023-09-09T22:25:56Z) - Automatically Correcting Large Language Models: Surveying the landscape
of diverse self-correction strategies [104.32199881187607]
Large language models (LLMs) have demonstrated remarkable performance across a wide array of NLP tasks.
A promising approach to rectify these flaws is self-correction, where the LLM itself is prompted or guided to fix problems in its own output.
This paper presents a comprehensive review of this emerging class of techniques.
arXiv Detail & Related papers (2023-08-06T18:38:52Z) - Automatic Prompt Optimization with "Gradient Descent" and Beam Search [64.08364384823645]
Large Language Models (LLMs) have shown impressive performance as general purpose agents, but their abilities remain highly dependent on prompts.
We propose a simple and nonparametric solution to this problem, Automatic Prompt Optimization (APO)
APO uses minibatches of data to form natural language "gradients" that criticize the current prompt.
The gradients are then "propagated" into the prompt by editing the prompt in the opposite semantic direction of the gradient.
arXiv Detail & Related papers (2023-05-04T15:15:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.