Using ChatGPT to Score Essays and Short-Form Constructed Responses
- URL: http://arxiv.org/abs/2408.09540v1
- Date: Sun, 18 Aug 2024 16:51:28 GMT
- Title: Using ChatGPT to Score Essays and Short-Form Constructed Responses
- Authors: Mark D. Shermis,
- Abstract summary: Investigation focused on various prediction models, including linear regression, random forest, gradient boost, and boost.
ChatGPT's performance was evaluated against human raters using quadratic weighted kappa (QWK) metrics.
Study concludes that ChatGPT can complement human scoring but requires additional development to be reliable for high-stakes assessments.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study aimed to determine if ChatGPT's large language models could match the scoring accuracy of human and machine scores from the ASAP competition. The investigation focused on various prediction models, including linear regression, random forest, gradient boost, and boost. ChatGPT's performance was evaluated against human raters using quadratic weighted kappa (QWK) metrics. Results indicated that while ChatGPT's gradient boost model achieved QWKs close to human raters for some data sets, its overall performance was inconsistent and often lower than human scores. The study highlighted the need for further refinement, particularly in handling biases and ensuring scoring fairness. Despite these challenges, ChatGPT demonstrated potential for scoring efficiency, especially with domain-specific fine-tuning. The study concludes that ChatGPT can complement human scoring but requires additional development to be reliable for high-stakes assessments. Future research should improve model accuracy, address ethical considerations, and explore hybrid models combining ChatGPT with empirical methods.
Related papers
- Is ChatGPT the Future of Causal Text Mining? A Comprehensive Evaluation
and Analysis [8.031131164056347]
This study conducts comprehensive evaluations of ChatGPT's causal text mining capabilities.
We introduce a benchmark that extends beyond general English datasets.
We also provide an evaluation framework to ensure fair comparisons between ChatGPT and previous approaches.
arXiv Detail & Related papers (2024-02-22T12:19:04Z) - ChatGPT for Suicide Risk Assessment on Social Media: Quantitative
Evaluation of Model Performance, Potentials and Limitations [5.8762433393846045]
This paper presents a framework for evaluating the interactive ChatGPT model in the context of suicidality assessment from social media posts.
We conduct a technical evaluation of ChatGPT's performance on this task using Zero-Shot and Few-Shot experiments.
Our results indicate that while ChatGPT attains considerable accuracy in this task, transformer-based models fine-tuned on human-annotated datasets exhibit superior performance.
arXiv Detail & Related papers (2023-06-15T16:01:30Z) - Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context.
This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other.
We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z) - Distilling ChatGPT for Explainable Automated Student Answer Assessment [19.604476650824516]
We introduce a novel framework that explores using ChatGPT, a cutting-edge large language model, for the concurrent tasks of student answer scoring and rationale generation.
Our experiments show that the proposed method improves the overall QWK score by 11% compared to ChatGPT.
arXiv Detail & Related papers (2023-05-22T12:11:39Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Exploring ChatGPT's Ability to Rank Content: A Preliminary Study on
Consistency with Human Preferences [6.821378903525802]
ChatGPT has consistently demonstrated a remarkable level of accuracy and reliability in terms of content evaluation.
A test set consisting of prompts is created, covering a wide range of use cases, and five models are utilized to generate corresponding responses.
Results on the test set show that ChatGPT's ranking preferences are consistent with human to a certain extent.
arXiv Detail & Related papers (2023-03-14T03:13:02Z) - Consistency Analysis of ChatGPT [65.268245109828]
This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour.
Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions.
arXiv Detail & Related papers (2023-03-11T01:19:01Z) - Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric.
Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments.
We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.