Related papers: Applying Large Language Models and Chain-of-Thought for Automatic Scoring

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

URL: http://arxiv.org/abs/2312.03748v2
Date: Fri, 16 Feb 2024 19:47:48 GMT
Title: Applying Large Language Models and Chain-of-Thought for Automatic Scoring
Authors: Gyeong-Geon Lee, Ehsan Latif, Xuansheng Wu, Ninghao Liu, and Xiaoming Zhai
Abstract summary: This study investigates the application of large language models (LLMs) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools.
Score: 23.076596289069506
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools among researchers and educators. With a testing dataset comprising six assessment tasks (three binomial and three trinomial) with 1,650 student responses, we employed six prompt engineering strategies to automatically score student responses. The six strategies combined zero-shot or few-shot learning with CoT, either alone or alongside item stem and scoring rubrics. Results indicated that few-shot (acc = .67) outperformed zero-shot learning (acc = .60), with 12.6% increase. CoT, when used without item stem and scoring rubrics, did not significantly affect scoring accuracy (acc = .60). However, CoT prompting paired with contextual item stems and rubrics proved to be a significant contributor to scoring accuracy (13.44% increase for zero-shot; 3.7% increase for few-shot). We found a more balanced accuracy across different proficiency categories when CoT was used with a scoring rubric, highlighting the importance of domain-specific reasoning in enhancing the effectiveness of LLMs in scoring tasks. We also found that GPT-4 demonstrated superior performance over GPT -3.5 in various scoring tasks when combined with the single-call greedy sampling or ensemble voting nucleus sampling strategy, showing 8.64% difference. Particularly, the single-call greedy sampling strategy with GPT-4 outperformed other approaches.

Related papers

Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data. These findings highlight the reliance on recall over rigorous logical inference. This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering [5.160473221022088]
This study explores the feasibility of using large language models (LLMs) for automated grading of conceptual questions. We compared the grading performance of GPT-4o with that of human teaching assistants (TAs) on ten quiz problems from the MEEN 361 course at Texas A&M University. Our analysis reveals that GPT-4o performs well when grading criteria are straightforward but struggles with nuanced answers.
arXiv Detail & Related papers (2024-11-06T04:41:13Z)
How Effectively Do LLMs Extract Feature-Sentiment Pairs from App Reviews? [2.218667838700643]
This study compares the performance of state-of-the-art LLMs, including GPT-4, ChatGPT, and different variants of Llama-2 chat. For predicting positive and neutral sentiments, GPT-4 achieves f1-scores of 76% and 45% in the zero-shot setting.
arXiv Detail & Related papers (2024-09-11T10:21:13Z)
Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study [4.80612909282198]
This study introduces a novel multi-task spatial evaluation dataset. The dataset encompasses twelve distinct task types, including spatial understanding and path planning. The study highlights the impact of prompt strategies on model performance in specific tasks.
arXiv Detail & Related papers (2024-08-26T17:25:16Z)
Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design [63.24275274981911]
Compound AI Systems consisting of many language model inference calls are increasingly employed. In this work, we construct systems, which we call Networks of Networks (NoNs) organized around the distinction between generating a proposed answer and verifying its correctness. We introduce a verifier-based judge NoN with K generators, an instantiation of "best-of-K" or "judge-based" compound AI systems.
arXiv Detail & Related papers (2024-07-23T20:40:37Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
Fine-tuning ChatGPT for Automatic Scoring [1.4833692070415454]
This study highlights the potential of fine-tuned ChatGPT (GPT3.5) for automatically scoring student written constructed responses. We compare the performance of fine-tuned GPT-3.5 with the fine-tuned state-of-the-art Google's generated language model, BERT.
arXiv Detail & Related papers (2023-10-16T05:09:16Z)
Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks [8.223311621898983]
GPT-4 with conversational prompts showed drastic improvement compared to GPT-4 with automatic prompting strategies. fully automated prompt engineering with no human in the loop requires more study and improvement.
arXiv Detail & Related papers (2023-10-11T00:21:00Z)
Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4. The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z)
ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding [86.08738156304224]
We introduce ZeroSCROLLS, a zero-shot benchmark for natural language understanding over long texts. We adapt six tasks from the SCROLLS benchmark, and add four new datasets, including two novel information fusing tasks. We find that Claude outperforms ChatGPT, and GPT-4 achieves the highest average score.
arXiv Detail & Related papers (2023-05-23T16:15:31Z)
Hint of Thought prompting: an explainable and zero-shot approach to reasoning tasks with LLMs [5.996787847938559]
This paper proposes a novel hint of thought (HoT) prompting with explain-ability and zero-shot generalization. It is decomposed into three steps: explainable sub-questions, logical reasoning, and answering. Experiments show that our HoT prompting has a significant advantage on the zero-shot reasoning task compared to existing zero-shot CoT.
arXiv Detail & Related papers (2023-05-19T06:30:17Z)
Progressive-Hint Prompting Improves Reasoning in Large Language Models [63.98629132836499]
This paper proposes a new prompting method, named Progressive-Hint Prompting (PHP) It enables automatic multiple interactions between users and Large Language Models (LLMs) by using previously generated answers as hints to progressively guide toward the correct answers. We conducted extensive and comprehensive experiments on seven benchmarks. The results show that PHP significantly improves accuracy while remaining highly efficient.
arXiv Detail & Related papers (2023-04-19T16:29:48Z)
Faithful Chain-of-Thought Reasoning [51.21714389639417]
Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of reasoning tasks. We propose Faithful CoT, a reasoning framework involving two stages: Translation and Problem Solving. This guarantees that the reasoning chain provides a faithful explanation of the final answer.
arXiv Detail & Related papers (2023-01-31T03:04:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.