Using Large Language Models for Automated Grading of Student Writing about Science
- URL: http://arxiv.org/abs/2412.18719v1
- Date: Wed, 25 Dec 2024 00:31:53 GMT
- Title: Using Large Language Models for Automated Grading of Student Writing about Science
- Authors: Chris Impey, Matthew Wenger, Nikhil Garuda, Shahriar Golchin, Sarah Stamer,
- Abstract summary: AI has introduced the possibility of using large language models (LLMs) to evaluate student writing.
An experiment was conducted using GPT-4 to determine if machine learning methods based on LLMs can match or exceed the reliability of instructor grading.
The results should also be applicable to non-science majors in university settings, where the content and modes of evaluation are similar.
- Score: 2.883578416080909
- License:
- Abstract: Assessing writing in large classes for formal or informal learners presents a significant challenge. Consequently, most large classes, particularly in science, rely on objective assessment tools such as multiple-choice quizzes, which have a single correct answer. The rapid development of AI has introduced the possibility of using large language models (LLMs) to evaluate student writing. An experiment was conducted using GPT-4 to determine if machine learning methods based on LLMs can match or exceed the reliability of instructor grading in evaluating short writing assignments on topics in astronomy. The audience consisted of adult learners in three massive open online courses (MOOCs) offered through Coursera. One course was on astronomy, the second was on astrobiology, and the third was on the history and philosophy of astronomy. The results should also be applicable to non-science majors in university settings, where the content and modes of evaluation are similar. The data comprised answers from 120 students to 12 questions across the three courses. GPT-4 was provided with total grades, model answers, and rubrics from an instructor for all three courses. In addition to evaluating how reliably the LLM reproduced instructor grades, the LLM was also tasked with generating its own rubrics. Overall, the LLM was more reliable than peer grading, both in aggregate and by individual student, and approximately matched instructor grades for all three online courses. The implication is that LLMs may soon be used for automated, reliable, and scalable grading of student science writing.
Related papers
- Humanity's Last Exam [253.45228996132735]
Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge.
It consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences.
Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval.
arXiv Detail & Related papers (2025-01-24T05:27:46Z) - CLR-Bench: Evaluating Large Language Models in College-level Reasoning [17.081788240112417]
Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks.
We present CLR-Bench to comprehensively evaluate the LLMs in complex college-level reasoning.
arXiv Detail & Related papers (2024-10-23T04:55:08Z) - Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course [49.296957552006226]
Using large language models (LLMs) for automatic evaluation has become an important evaluation method in NLP research.
This report shares how we use GPT-4 as an automatic assignment evaluator in a university course with 1,028 students.
arXiv Detail & Related papers (2024-07-07T00:17:24Z) - LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks.
This study focuses on the topic of LLMs assist NLP Researchers.
To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z) - Grading Massive Open Online Courses Using Large Language Models [3.0936354370614607]
Massive open online courses (MOOCs) offer free education globally.
Peer grading, often guided by a straightforward rubric, is the method of choice.
We explore the feasibility of using large language models (LLMs) to replace peer grading in MOOCs.
arXiv Detail & Related papers (2024-06-16T23:42:11Z) - SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading [100.02175403852253]
One common use of Large Language Models (LLMs) is performing tasks on scientific topics.
Inspired by the way university students are evaluated on such tasks, we propose SciEx - a benchmark consisting of university computer science exam questions.
We evaluate the performance of various state-of-the-art LLMs on our new benchmark.
arXiv Detail & Related papers (2024-06-14T21:52:21Z) - CourseAssist: Pedagogically Appropriate AI Tutor for Computer Science Education [1.052788652996288]
This poster introduces CourseAssist, a novel LLM-based tutoring system tailored for computer science education.
Unlike generic LLM systems, CourseAssist uses retrieval-augmented generation, user intent classification, and question decomposition to align AI responses with specific course materials and learning objectives.
arXiv Detail & Related papers (2024-05-01T20:43:06Z) - CS1-LLM: Integrating LLMs into CS1 Instruction [0.6282171844772422]
This experience report describes a CS1 course at a large research-intensive university that fully embraces the use of Large Language Models.
To incorporate the LLMs, the course was intentionally altered to reduce emphasis on syntax and writing code from scratch.
Students were given three large, open-ended projects in three separate domains that allowed them to showcase their creativity.
arXiv Detail & Related papers (2024-04-17T14:44:28Z) - Large Language Models As MOOCs Graders [3.379574469735166]
We explore the feasibility of leveraging large language models (LLMs) to replace peer grading in MOOCs.
To instruct LLMs, we use three different prompts based on a variant of the zero-shot chain-of-thought prompting technique.
Our results show that Zero-shot-CoT, when integrated with instructor-provided answers and rubrics, produces grades that are more aligned with those assigned by instructors.
arXiv Detail & Related papers (2024-02-06T07:43:07Z) - SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models [57.96527452844273]
We introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning.
We curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs.
To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B: MetaMath.
arXiv Detail & Related papers (2024-01-15T20:22:21Z) - ElitePLM: An Empirical Study on General Language Ability Evaluation of
Pretrained Language Models [78.08792285698853]
We present a large-scale empirical study on general language ability evaluation of pretrained language models (ElitePLM)
Our empirical results demonstrate that: (1) PLMs with varying training objectives and strategies are good at different ability tests; (2) fine-tuning PLMs in downstream tasks is usually sensitive to the data size and distribution; and (3) PLMs have excellent transferability between similar tasks.
arXiv Detail & Related papers (2022-05-03T14:18:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.