Related papers: Evaluating Students' Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large

Evaluating Students' Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large

URL: http://arxiv.org/abs/2405.05444v1
Date: Wed, 8 May 2024 22:23:58 GMT
Title: Evaluating Students' Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large
Authors: Jussi S. Jauhiainen, Agustín Garagorry Guerra,
Abstract summary: evaluating open-ended written responses from students is an essential yet time-intensive task for educators. Recent developments in Large Language Models (LLMs) present a promising opportunity to balance the need for thorough evaluation with efficient use of educators' time.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Evaluating open-ended written examination responses from students is an essential yet time-intensive task for educators, requiring a high degree of effort, consistency, and precision. Recent developments in Large Language Models (LLMs) present a promising opportunity to balance the need for thorough evaluation with efficient use of educators' time. In our study, we explore the effectiveness of LLMs ChatGPT-3.5, ChatGPT-4, Claude-3, and Mistral-Large in assessing university students' open-ended answers to questions made about reference material they have studied. Each model was instructed to evaluate 54 answers repeatedly under two conditions: 10 times (10-shot) with a temperature setting of 0.0 and 10 times with a temperature of 0.5, expecting a total of 1,080 evaluations per model and 4,320 evaluations across all models. The RAG (Retrieval Augmented Generation) framework was used as the framework to make the LLMs to process the evaluation of the answers. As of spring 2024, our analysis revealed notable variations in consistency and the grading outcomes provided by studied LLMs. There is a need to comprehend strengths and weaknesses of LLMs in educational settings for evaluating open-ended written responses. Further comparative research is essential to determine the accuracy and cost-effectiveness of using LLMs for educational assessments.

Related papers

LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z)
Measuring Determinism in Large Language Models for Software Code Review [7.46879002825422]
Large Language Models (LLMs) promise to streamline software code reviews, but their ability to produce consistent assessments remains an open question. We tested four leading LLMs on 70 Java commits from both private and public repositories.
arXiv Detail & Related papers (2025-02-28T05:53:16Z)
Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs) [0.5434005537854512]
This study explored the potential of large language models (LLMs) to automate OSCE evaluations using the Master Interview Rating Scale (MIRS) We compared the performance of four state-of-the-art LLMs in evaluating OSCE transcripts across all 28 items of the MIRS under the conditions of zero-shot, chain-of-thought (CoT), few-shot, and multi-step prompting.
arXiv Detail & Related papers (2025-01-21T04:05:45Z)
A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed. We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z)
Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses [0.0]
This study aims to explore the potential of Large Language Models (LLMs) in facilitating automated feedback in math education. We employ Mistral, a version of Llama catered to math, and fine-tune this model for evaluating student responses by leveraging a dataset of student responses and teacher-written feedback for middle-school math problems. We evaluate the model's performance in scoring accuracy and the quality of feedback by utilizing judgments from 2 teachers.
arXiv Detail & Related papers (2024-10-29T16:57:45Z)
Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course [49.296957552006226]
Using large language models (LLMs) for automatic evaluation has become an important evaluation method in NLP research. This report shares how we use GPT-4 as an automatic assignment evaluator in a university course with 1,028 students.
arXiv Detail & Related papers (2024-07-07T00:17:24Z)
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists [23.381287828102995]
We investigate the effectiveness of Large Language Models (LLMs) as evaluators for text generation tasks. We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities.
arXiv Detail & Related papers (2024-06-19T10:59:48Z)
Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators. The question of how reliable these evaluators are has emerged as a crucial research question. We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
State of What Art? A Call for Multi-Prompt LLM Evaluation [28.307860675006545]
We comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances. To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead.
arXiv Detail & Related papers (2023-12-31T22:21:36Z)
LLMEval: A Preliminary Study on How to Evaluate Large Language Models [47.12588320134504]
We analyze evaluation methods by comparing various criteria with both manual and automatic evaluation, utilizing onsite, crowd-sourcing, public annotators and GPT-4. A total of 2,186 individuals participated, leading to the generation of 243,337 manual annotations and 57,511 automatic evaluation results.
arXiv Detail & Related papers (2023-12-12T16:14:43Z)
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z)
L-Eval: Instituting Standardized Evaluation for Long Context Language Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)
A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry. This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)
Boosting Theory-of-Mind Performance in Large Language Models via Prompting [2.538209532048867]
This study measures the ToM performance of GPT-4 and three GPT-3.5 variants. We investigated the effectiveness of in-context learning in improving ToM comprehension.
arXiv Detail & Related papers (2023-04-22T22:50:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.