ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction
Benchmark
- URL: http://arxiv.org/abs/2303.13648v1
- Date: Wed, 15 Mar 2023 00:35:50 GMT
- Title: ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction
Benchmark
- Authors: Haoran Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang Jiao, Michael Lyu
- Abstract summary: ChatGPT is a cutting-edge artificial intelligence language model developed by OpenAI.
We compare it with commercial GEC product (e.g., Grammarly) and state-of-the-art models (e.g., GECToR)
We find that ChatGPT performs not as well as those baselines in terms of the automatic evaluation metrics.
- Score: 11.36853733574956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: ChatGPT is a cutting-edge artificial intelligence language model developed by
OpenAI, which has attracted a lot of attention due to its surprisingly strong
ability in answering follow-up questions. In this report, we aim to evaluate
ChatGPT on the Grammatical Error Correction(GEC) task, and compare it with
commercial GEC product (e.g., Grammarly) and state-of-the-art models (e.g.,
GECToR). By testing on the CoNLL2014 benchmark dataset, we find that ChatGPT
performs not as well as those baselines in terms of the automatic evaluation
metrics (e.g., $F_{0.5}$ score), particularly on long sentences. We inspect the
outputs and find that ChatGPT goes beyond one-by-one corrections. Specifically,
it prefers to change the surface expression of certain phrases or sentence
structure while maintaining grammatical correctness. Human evaluation
quantitatively confirms this and suggests that ChatGPT produces less
under-correction or mis-correction issues but more over-corrections. These
results demonstrate that ChatGPT is severely under-estimated by the automatic
evaluation metrics and could be a promising tool for GEC.
Related papers
- Exploring the Capability of ChatGPT to Reproduce Human Labels for Social Computing Tasks (Extended Version) [26.643834593780007]
We investigate the extent to which ChatGPT can annotate data for social computing tasks.
ChatGPT exhibits promise in handling data annotation tasks, albeit with some challenges.
We propose GPT-Rater, a tool to predict if ChatGPT can correctly label data for a given annotation task.
arXiv Detail & Related papers (2024-07-08T22:04:30Z) - Eval-GCSC: A New Metric for Evaluating ChatGPT's Performance in Chinese
Spelling Correction [60.32771192285546]
ChatGPT has demonstrated impressive performance in various downstream tasks.
In the Chinese Spelling Correction (CSC) task, we observe a discrepancy: while ChatGPT performs well under human evaluation, it scores poorly according to traditional metrics.
This paper proposes a new evaluation metric: Eval-GCSC. By incorporating word-level and semantic similarity judgments, it relaxes the stringent length and phonics constraints.
arXiv Detail & Related papers (2023-11-14T14:56:33Z) - ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time [54.18651663847874]
ChatGPT has achieved great success and can be considered to have acquired an infrastructural status.
Existing benchmarks encounter two challenges: (1) Disregard for periodical evaluation and (2) Lack of fine-grained features.
We construct ChatLog, an ever-updating dataset with large-scale records of diverse long-form ChatGPT responses for 21 NLP benchmarks from March, 2023 to now.
arXiv Detail & Related papers (2023-04-27T11:33:48Z) - Evaluating ChatGPT's Information Extraction Capabilities: An Assessment
of Performance, Explainability, Calibration, and Faithfulness [18.945934162722466]
We focus on assessing the overall ability of ChatGPT using 7 fine-grained information extraction (IE) tasks.
ChatGPT's performance in Standard-IE setting is poor, but it surprisingly exhibits excellent performance in the OpenIE setting.
ChatGPT provides high-quality and trustworthy explanations for its decisions.
arXiv Detail & Related papers (2023-04-23T12:33:18Z) - ChatGPT-Crawler: Find out if ChatGPT really knows what it's talking
about [15.19126287569545]
This research examines the responses generated by ChatGPT from different Conversational QA corpora.
The study employed BERT similarity scores to compare these responses with correct answers and obtain Natural Language Inference(NLI) labels.
The study identified instances where ChatGPT provided incorrect answers to questions, providing insights into areas where the model may be prone to error.
arXiv Detail & Related papers (2023-04-06T18:42:47Z) - Is ChatGPT a Highly Fluent Grammatical Error Correction System? A
Comprehensive Evaluation [41.94480044074273]
ChatGPT is a large-scale language model based on the advanced GPT-3.5 architecture.
We design zero-shot chain-of-thought (CoT) and few-shot CoT settings using in-context learning for ChatGPT.
Our evaluation involves assessing ChatGPT's performance on five official test sets in three different languages, along with three document-level GEC test sets in English.
arXiv Detail & Related papers (2023-04-04T12:33:40Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric.
Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments.
We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.