Empirical Study of Large Language Models as Automated Essay Scoring
Tools in English Composition__Taking TOEFL Independent Writing Task for
Example
- URL: http://arxiv.org/abs/2401.03401v1
- Date: Sun, 7 Jan 2024 07:13:50 GMT
- Title: Empirical Study of Large Language Models as Automated Essay Scoring
Tools in English Composition__Taking TOEFL Independent Writing Task for
Example
- Authors: Wei Xia, Shaoguang Mao, Chanjing Zheng
- Abstract summary: This study aims to assess the capabilities and constraints of ChatGPT, a prominent representative of large language models.
This study employs ChatGPT to conduct an automated evaluation of English essays, even with a small sample size.
- Score: 25.220438332156114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have demonstrated exceptional capabilities in tasks
involving natural language generation, reasoning, and comprehension. This study
aims to construct prompts and comments grounded in the diverse scoring criteria
delineated within the official TOEFL guide. The primary objective is to assess
the capabilities and constraints of ChatGPT, a prominent representative of
large language models, within the context of automated essay scoring. The
prevailing methodologies for automated essay scoring involve the utilization of
deep neural networks, statistical machine learning techniques, and fine-tuning
pre-trained models. However, these techniques face challenges when applied to
different contexts or subjects, primarily due to their substantial data
requirements and limited adaptability to small sample sizes. In contrast, this
study employs ChatGPT to conduct an automated evaluation of English essays,
even with a small sample size, employing an experimental approach. The
empirical findings indicate that ChatGPT can provide operational functionality
for automated essay scoring, although the results exhibit a regression effect.
It is imperative to underscore that the effective design and implementation of
ChatGPT prompts necessitate a profound domain expertise and technical
proficiency, as these prompts are subject to specific threshold criteria.
Keywords: ChatGPT, Automated Essay Scoring, Prompt Learning, TOEFL Independent
Writing Task
Related papers
- Automatic Generation of Behavioral Test Cases For Natural Language Processing Using Clustering and Prompting [6.938766764201549]
This paper introduces an automated approach to develop test cases by exploiting the power of large language models and statistical techniques.
We analyze the behavioral test profiles across four different classification algorithms and discuss the limitations and strengths of those models.
arXiv Detail & Related papers (2024-07-31T21:12:21Z) - CELA: Cost-Efficient Language Model Alignment for CTR Prediction [71.85120354973073]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.
Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)
We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z) - Enhancing Essay Scoring with Adversarial Weights Perturbation and
Metric-specific AttentionPooling [18.182517741584707]
This study explores the application of BERT-related techniques to enhance the assessment of ELLs' writing proficiency.
To address the specific needs of ELLs, we propose the use of DeBERTa, a state-of-the-art neural language model.
arXiv Detail & Related papers (2024-01-06T06:05:12Z) - Investigating the Efficacy of Large Language Models in Reflective
Assessment Methods through Chain of Thoughts Prompting [0.2552922646705803]
Chain of Thought(CoT) prompting method has been proposed as a means to enhance LLMs' proficiency in complex reasoning tasks.
The primary aim of this research is to assess how well four language models can grade reflective essays of third-year medical students.
arXiv Detail & Related papers (2023-09-30T06:25:27Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - MetricPrompt: Prompting Model as a Relevance Metric for Few-shot Text
Classification [65.51149771074944]
MetricPrompt eases verbalizer design difficulty by reformulating few-shot text classification task into text pair relevance estimation task.
We conduct experiments on three widely used text classification datasets across four few-shot settings.
Results show that MetricPrompt outperforms manual verbalizer and other automatic verbalizer design methods across all few-shot settings.
arXiv Detail & Related papers (2023-06-15T06:51:35Z) - Distilling ChatGPT for Explainable Automated Student Answer Assessment [19.604476650824516]
We introduce a novel framework that explores using ChatGPT, a cutting-edge large language model, for the concurrent tasks of student answer scoring and rationale generation.
Our experiments show that the proposed method improves the overall QWK score by 11% compared to ChatGPT.
arXiv Detail & Related papers (2023-05-22T12:11:39Z) - Analyzing the Limits of Self-Supervision in Handling Bias in Language [52.26068057260399]
We evaluate how well language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing.
Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.
arXiv Detail & Related papers (2021-12-16T05:36:08Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Prompt Programming for Large Language Models: Beyond the Few-Shot
Paradigm [0.0]
We discuss methods of prompt programming, emphasizing the usefulness of considering prompts through the lens of natural language.
We introduce the idea of a metaprompt that seeds the model to generate its own natural language prompts for a range of tasks.
arXiv Detail & Related papers (2021-02-15T05:27:55Z) - My Teacher Thinks The World Is Flat! Interpreting Automatic Essay
Scoring Mechanism [71.34160809068996]
Recent work shows that automated scoring systems are prone to even common-sense adversarial samples.
We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms.
We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as the world is flat'' actually increases the score instead of decreasing it.
arXiv Detail & Related papers (2020-12-27T06:19:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.