Related papers: Using Large Language Models for Automated Grading of Student Writing about Science

Using Large Language Models for Automated Grading of Student Writing about Science

URL: http://arxiv.org/abs/2412.18719v1
Date: Wed, 25 Dec 2024 00:31:53 GMT
Title: Using Large Language Models for Automated Grading of Student Writing about Science
Authors: Chris Impey, Matthew Wenger, Nikhil Garuda, Shahriar Golchin, Sarah Stamer,
Abstract summary: AI has introduced the possibility of using large language models (LLMs) to evaluate student writing.<n>An experiment was conducted using GPT-4 to determine if machine learning methods based on LLMs can match or exceed the reliability of instructor grading.<n>The results should also be applicable to non-science majors in university settings, where the content and modes of evaluation are similar.
Score: 2.883578416080909
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Assessing writing in large classes for formal or informal learners presents a significant challenge. Consequently, most large classes, particularly in science, rely on objective assessment tools such as multiple-choice quizzes, which have a single correct answer. The rapid development of AI has introduced the possibility of using large language models (LLMs) to evaluate student writing. An experiment was conducted using GPT-4 to determine if machine learning methods based on LLMs can match or exceed the reliability of instructor grading in evaluating short writing assignments on topics in astronomy. The audience consisted of adult learners in three massive open online courses (MOOCs) offered through Coursera. One course was on astronomy, the second was on astrobiology, and the third was on the history and philosophy of astronomy. The results should also be applicable to non-science majors in university settings, where the content and modes of evaluation are similar. The data comprised answers from 120 students to 12 questions across the three courses. GPT-4 was provided with total grades, model answers, and rubrics from an instructor for all three courses. In addition to evaluating how reliably the LLM reproduced instructor grades, the LLM was also tasked with generating its own rubrics. Overall, the LLM was more reliable than peer grading, both in aggregate and by individual student, and approximately matched instructor grades for all three online courses. The implication is that LLMs may soon be used for automated, reliable, and scalable grading of student science writing.

Related papers

An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics [0.7310043452300737]
This study evaluates the pedagogical soundness and usability of AI-generated lesson plans across five leading large language models.<n>Fifteen lesson plans were generated for a single high-school physics topic, The Electromagnetic Spectrum.
arXiv Detail & Related papers (2025-10-22T02:53:06Z)
Investigating Student Interaction Patterns with Large Language Model-Powered Course Assistants in Computer Science Courses [4.761218834684297]
Large language models (LLMs) are promising for bridging this gap, but interactions between students and LLMs are rarely overseen by educators.<n>We developed and studied an LLM-powered course assistant deployed across multiple computer science courses to characterize real-world use and understand pedagogical implications.
arXiv Detail & Related papers (2025-09-10T02:21:11Z)
Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper [64.50822834679101]
SciIG is a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works.<n>We assess five state-of-the-art models, including open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems.<n>Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness.
arXiv Detail & Related papers (2025-08-19T21:11:11Z)
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs) Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z)
CLR-Bench: Evaluating Large Language Models in College-level Reasoning [17.081788240112417]
Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks. We present CLR-Bench to comprehensively evaluate the LLMs in complex college-level reasoning.
arXiv Detail & Related papers (2024-10-23T04:55:08Z)
Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course [49.296957552006226]
Using large language models (LLMs) for automatic evaluation has become an important evaluation method in NLP research. This report shares how we use GPT-4 as an automatic assignment evaluator in a university course with 1,028 students.
arXiv Detail & Related papers (2024-07-07T00:17:24Z)
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks. This study focuses on the topic of LLMs assist NLP Researchers. To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z)
Grading Massive Open Online Courses Using Large Language Models [3.0936354370614607]
Massive open online courses (MOOCs) offer free education globally.<n>Peer grading, often guided by a straightforward rubric, is the method of choice.<n>We explore the feasibility of using large language models (LLMs) to replace peer grading in MOOCs.
arXiv Detail & Related papers (2024-06-16T23:42:11Z)
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading [100.02175403852253]
One common use of Large Language Models (LLMs) is performing tasks on scientific topics. Inspired by the way university students are evaluated on such tasks, we propose SciEx - a benchmark consisting of university computer science exam questions. We evaluate the performance of various state-of-the-art LLMs on our new benchmark.
arXiv Detail & Related papers (2024-06-14T21:52:21Z)
CourseAssist: Pedagogically Appropriate AI Tutor for Computer Science Education [1.052788652996288]
This poster introduces CourseAssist, a novel LLM-based tutoring system tailored for computer science education. Unlike generic LLM systems, CourseAssist uses retrieval-augmented generation, user intent classification, and question decomposition to align AI responses with specific course materials and learning objectives.
arXiv Detail & Related papers (2024-05-01T20:43:06Z)
CS1-LLM: Integrating LLMs into CS1 Instruction [0.6282171844772422]
This experience report describes a CS1 course at a large research-intensive university that fully embraces the use of Large Language Models. To incorporate the LLMs, the course was intentionally altered to reduce emphasis on syntax and writing code from scratch. Students were given three large, open-ended projects in three separate domains that allowed them to showcase their creativity.
arXiv Detail & Related papers (2024-04-17T14:44:28Z)
Large Language Models As MOOCs Graders [3.379574469735166]
We explore the feasibility of leveraging large language models (LLMs) to replace peer grading in MOOCs. To instruct LLMs, we use three different prompts based on a variant of the zero-shot chain-of-thought prompting technique. Our results show that Zero-shot-CoT, when integrated with instructor-provided answers and rubrics, produces grades that are more aligned with those assigned by instructors.
arXiv Detail & Related papers (2024-02-06T07:43:07Z)
SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models [57.96527452844273]
We introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning. We curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs. To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B: MetaMath.
arXiv Detail & Related papers (2024-01-15T20:22:21Z)
ElitePLM: An Empirical Study on General Language Ability Evaluation of Pretrained Language Models [78.08792285698853]
We present a large-scale empirical study on general language ability evaluation of pretrained language models (ElitePLM) Our empirical results demonstrate that: (1) PLMs with varying training objectives and strategies are good at different ability tests; (2) fine-tuning PLMs in downstream tasks is usually sensitive to the data size and distribution; and (3) PLMs have excellent transferability between similar tasks.
arXiv Detail & Related papers (2022-05-03T14:18:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.