Applying Large Language Models and Chain-of-Thought for Automatic
Scoring
- URL: http://arxiv.org/abs/2312.03748v2
- Date: Fri, 16 Feb 2024 19:47:48 GMT
- Title: Applying Large Language Models and Chain-of-Thought for Automatic
Scoring
- Authors: Gyeong-Geon Lee, Ehsan Latif, Xuansheng Wu, Ninghao Liu, and Xiaoming
Zhai
- Abstract summary: This study investigates the application of large language models (LLMs) in the automatic scoring of student-written responses to science assessments.
We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools.
- Score: 23.076596289069506
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This study investigates the application of large language models (LLMs),
specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic
scoring of student-written responses to science assessments. We focused on
overcoming the challenges of accessibility, technical complexity, and lack of
explainability that have previously limited the use of artificial
intelligence-based automatic scoring tools among researchers and educators.
With a testing dataset comprising six assessment tasks (three binomial and
three trinomial) with 1,650 student responses, we employed six prompt
engineering strategies to automatically score student responses. The six
strategies combined zero-shot or few-shot learning with CoT, either alone or
alongside item stem and scoring rubrics. Results indicated that few-shot (acc =
.67) outperformed zero-shot learning (acc = .60), with 12.6% increase. CoT,
when used without item stem and scoring rubrics, did not significantly affect
scoring accuracy (acc = .60). However, CoT prompting paired with contextual
item stems and rubrics proved to be a significant contributor to scoring
accuracy (13.44% increase for zero-shot; 3.7% increase for few-shot). We found
a more balanced accuracy across different proficiency categories when CoT was
used with a scoring rubric, highlighting the importance of domain-specific
reasoning in enhancing the effectiveness of LLMs in scoring tasks. We also
found that GPT-4 demonstrated superior performance over GPT -3.5 in various
scoring tasks when combined with the single-call greedy sampling or ensemble
voting nucleus sampling strategy, showing 8.64% difference. Particularly, the
single-call greedy sampling strategy with GPT-4 outperformed other approaches.
Related papers
- Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering [5.160473221022088]
This study explores the feasibility of using large language models (LLMs) for automated grading of conceptual questions.
We compared the grading performance of GPT-4o with that of human teaching assistants (TAs) on ten quiz problems from the MEEN 361 course at Texas A&M University.
Our analysis reveals that GPT-4o performs well when grading criteria are straightforward but struggles with nuanced answers.
arXiv Detail & Related papers (2024-11-06T04:41:13Z) - Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study [4.80612909282198]
This study introduces a novel multi-task spatial evaluation dataset.
The dataset encompasses twelve distinct task types, including spatial understanding and path planning.
The study highlights the impact of prompt strategies on model performance in specific tasks.
arXiv Detail & Related papers (2024-08-26T17:25:16Z) - Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design [63.24275274981911]
Compound AI Systems consisting of many language model inference calls are increasingly employed.
In this work, we construct systems, which we call Networks of Networks (NoNs) organized around the distinction between generating a proposed answer and verifying its correctness.
We introduce a verifier-based judge NoN with K generators, an instantiation of "best-of-K" or "judge-based" compound AI systems.
arXiv Detail & Related papers (2024-07-23T20:40:37Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z) - Fine-tuning ChatGPT for Automatic Scoring [1.4833692070415454]
This study highlights the potential of fine-tuned ChatGPT (GPT3.5) for automatically scoring student written constructed responses.
We compare the performance of fine-tuned GPT-3.5 with the fine-tuned state-of-the-art Google's generated language model, BERT.
arXiv Detail & Related papers (2023-10-16T05:09:16Z) - Prompt Engineering or Fine Tuning: An Empirical Assessment of Large
Language Models in Automated Software Engineering Tasks [8.223311621898983]
GPT-4 with conversational prompts showed drastic improvement compared to GPT-4 with automatic prompting strategies.
fully automated prompt engineering with no human in the loop requires more study and improvement.
arXiv Detail & Related papers (2023-10-11T00:21:00Z) - Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4.
The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z) - ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding [86.08738156304224]
We introduce ZeroSCROLLS, a zero-shot benchmark for natural language understanding over long texts.
We adapt six tasks from the SCROLLS benchmark, and add four new datasets, including two novel information fusing tasks.
We find that Claude outperforms ChatGPT, and GPT-4 achieves the highest average score.
arXiv Detail & Related papers (2023-05-23T16:15:31Z) - Hint of Thought prompting: an explainable and zero-shot approach to reasoning tasks with LLMs [5.996787847938559]
This paper proposes a novel hint of thought (HoT) prompting with explain-ability and zero-shot generalization.
It is decomposed into three steps: explainable sub-questions, logical reasoning, and answering.
Experiments show that our HoT prompting has a significant advantage on the zero-shot reasoning task compared to existing zero-shot CoT.
arXiv Detail & Related papers (2023-05-19T06:30:17Z) - Progressive-Hint Prompting Improves Reasoning in Large Language Models [63.98629132836499]
This paper proposes a new prompting method, named Progressive-Hint Prompting (PHP)
It enables automatic multiple interactions between users and Large Language Models (LLMs) by using previously generated answers as hints to progressively guide toward the correct answers.
We conducted extensive and comprehensive experiments on seven benchmarks. The results show that PHP significantly improves accuracy while remaining highly efficient.
arXiv Detail & Related papers (2023-04-19T16:29:48Z) - Faithful Chain-of-Thought Reasoning [51.21714389639417]
Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of reasoning tasks.
We propose Faithful CoT, a reasoning framework involving two stages: Translation and Problem Solving.
This guarantees that the reasoning chain provides a faithful explanation of the final answer.
arXiv Detail & Related papers (2023-01-31T03:04:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.