Applying Large Language Models and Chain-of-Thought for Automatic
Scoring
- URL: http://arxiv.org/abs/2312.03748v2
- Date: Fri, 16 Feb 2024 19:47:48 GMT
- Title: Applying Large Language Models and Chain-of-Thought for Automatic
Scoring
- Authors: Gyeong-Geon Lee, Ehsan Latif, Xuansheng Wu, Ninghao Liu, and Xiaoming
Zhai
- Abstract summary: This study investigates the application of large language models (LLMs) in the automatic scoring of student-written responses to science assessments.
We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools.
- Score: 23.076596289069506
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This study investigates the application of large language models (LLMs),
specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic
scoring of student-written responses to science assessments. We focused on
overcoming the challenges of accessibility, technical complexity, and lack of
explainability that have previously limited the use of artificial
intelligence-based automatic scoring tools among researchers and educators.
With a testing dataset comprising six assessment tasks (three binomial and
three trinomial) with 1,650 student responses, we employed six prompt
engineering strategies to automatically score student responses. The six
strategies combined zero-shot or few-shot learning with CoT, either alone or
alongside item stem and scoring rubrics. Results indicated that few-shot (acc =
.67) outperformed zero-shot learning (acc = .60), with 12.6% increase. CoT,
when used without item stem and scoring rubrics, did not significantly affect
scoring accuracy (acc = .60). However, CoT prompting paired with contextual
item stems and rubrics proved to be a significant contributor to scoring
accuracy (13.44% increase for zero-shot; 3.7% increase for few-shot). We found
a more balanced accuracy across different proficiency categories when CoT was
used with a scoring rubric, highlighting the importance of domain-specific
reasoning in enhancing the effectiveness of LLMs in scoring tasks. We also
found that GPT-4 demonstrated superior performance over GPT -3.5 in various
scoring tasks when combined with the single-call greedy sampling or ensemble
voting nucleus sampling strategy, showing 8.64% difference. Particularly, the
single-call greedy sampling strategy with GPT-4 outperformed other approaches.
Related papers
- Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design [63.24275274981911]
Compound AI Systems consisting of many language model inference calls are increasingly employed.
In this work, we construct systems, which we call Networks of Networks (NoNs) organized around the distinction between generating a proposed answer and verifying its correctness.
We introduce a verifier-based judge NoN with K generators, an instantiation of "best-of-K" or "judge-based" compound AI systems.
arXiv Detail & Related papers (2024-07-23T20:40:37Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z) - The Benefits of a Concise Chain of Thought on Problem-Solving in Large
Language Models [0.0]
CCoT reduced average response length by 48.70% for both GPT-3.5 and GPT-4 while having a negligible impact on problem-solving performance.
Overall, CCoT leads to an average per-token cost reduction of 22.67%.
arXiv Detail & Related papers (2024-01-11T01:52:25Z) - Using GPT-4 to Augment Unbalanced Data for Automatic Scoring [0.6278186810520364]
We introduce a novel text data augmentation framework using GPT-4, a generative large language model.
We crafted prompts for GPT-4 to generate responses resembling student-written answers, particularly for minority scoring classes.
We finetuned DistillBERT for automatic scoring based on the augmented and original datasets.
arXiv Detail & Related papers (2023-10-25T01:07:50Z) - Fine-tuning ChatGPT for Automatic Scoring [1.4833692070415454]
This study highlights the potential of fine-tuned ChatGPT (GPT3.5) for automatically scoring student written constructed responses.
We compare the performance of fine-tuned GPT-3.5 with the fine-tuned state-of-the-art Google's generated language model, BERT.
arXiv Detail & Related papers (2023-10-16T05:09:16Z) - Prompt Engineering or Fine Tuning: An Empirical Assessment of Large
Language Models in Automated Software Engineering Tasks [8.223311621898983]
GPT-4 with conversational prompts showed drastic improvement compared to GPT-4 with automatic prompting strategies.
fully automated prompt engineering with no human in the loop requires more study and improvement.
arXiv Detail & Related papers (2023-10-11T00:21:00Z) - Exploring Small Language Models with Prompt-Learning Paradigm for
Efficient Domain-Specific Text Classification [2.410463233396231]
Small language models (SLMs) offer significant customizability, adaptability, and cost-effectiveness for domain-specific tasks.
In few-shot settings when prompt-based model fine-tuning is possible, T5-base, a typical SLM with 220M parameters, achieve approximately 75% accuracy with limited labeled data.
In zero-shot settings with a fixed model, we underscore a pivotal observation that, although the GPT-3.5-turbo equipped with around 154B parameters garners an accuracy of 55.16%, the power of well designed prompts becomes evident.
arXiv Detail & Related papers (2023-09-26T09:24:46Z) - Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4.
The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z) - ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding [86.08738156304224]
We introduce ZeroSCROLLS, a zero-shot benchmark for natural language understanding over long texts.
We adapt six tasks from the SCROLLS benchmark, and add four new datasets, including two novel information fusing tasks.
We find that Claude outperforms ChatGPT, and GPT-4 achieves the highest average score.
arXiv Detail & Related papers (2023-05-23T16:15:31Z) - Faithful Chain-of-Thought Reasoning [51.21714389639417]
Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of reasoning tasks.
We propose Faithful CoT, a reasoning framework involving two stages: Translation and Problem Solving.
This guarantees that the reasoning chain provides a faithful explanation of the final answer.
arXiv Detail & Related papers (2023-01-31T03:04:26Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.