ErAConD : Error Annotated Conversational Dialog Dataset for Grammatical
Error Correction
- URL: http://arxiv.org/abs/2112.08466v1
- Date: Wed, 15 Dec 2021 20:27:40 GMT
- Title: ErAConD : Error Annotated Conversational Dialog Dataset for Grammatical
Error Correction
- Authors: Xun Yuan, Derek Pham, Sam Davidson, Zhou Yu
- Abstract summary: We present a novel parallel grammatical error correction (GEC) dataset drawn from open-domain conversations.
This dataset is, to our knowledge, the first GEC dataset targeted to a conversational setting.
To demonstrate the utility of the dataset, we use our annotated data to fine-tune a state-of-the-art GEC model.
- Score: 30.917993017459615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Currently available grammatical error correction (GEC) datasets are compiled
using well-formed written text, limiting the applicability of these datasets to
other domains such as informal writing and dialog. In this paper, we present a
novel parallel GEC dataset drawn from open-domain chatbot conversations; this
dataset is, to our knowledge, the first GEC dataset targeted to a
conversational setting. To demonstrate the utility of the dataset, we use our
annotated data to fine-tune a state-of-the-art GEC model, resulting in a 16
point increase in model precision. This is of particular importance in a GEC
model, as model precision is considered more important than recall in GEC tasks
since false positives could lead to serious confusion in language learners. We
also present a detailed annotation scheme which ranks errors by perceived
impact on comprehensibility, making our dataset both reproducible and
extensible. Experimental results show the effectiveness of our data in
improving GEC model performance in conversational scenario.
Related papers
- Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation [73.9145653659403]
We show that Generative Error Correction models struggle to generalize beyond the specific types of errors encountered during training.
We propose DARAG, a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios.
Our approach is simple, scalable, and both domain- and language-agnostic.
arXiv Detail & Related papers (2024-10-17T04:00:29Z) - ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction [6.220415006158471]
We introduce a new dataset for grammatical error correction tasks, named ChatLang-8.
ChatLang-8 consists of 1 million pairs featuring human-like grammatical errors.
We observe improved model performance when using ChatLang-8 instead of existing GEC datasets.
arXiv Detail & Related papers (2024-06-05T12:35:00Z) - Towards End-to-End Spoken Grammatical Error Correction [33.116296120680296]
Spoken grammatical error correction (GEC) aims to supply feedback to L2 learners on their use of grammar when speaking.
This paper introduces an alternative "end-to-end" approach to spoken GEC, exploiting a speech recognition foundation model, Whisper.
arXiv Detail & Related papers (2023-11-09T17:49:02Z) - Advancements in Arabic Grammatical Error Detection and Correction: An
Empirical Investigation [12.15509670220182]
Grammatical error correction (GEC) is a well-explored problem in English.
Research on GEC in morphologically rich languages has been limited due to challenges such as data scarcity and language complexity.
We present the first results on Arabic GEC using two newly developed Transformer-based pretrained sequence-to-sequence models.
arXiv Detail & Related papers (2023-05-24T05:12:58Z) - A Unified Strategy for Multilingual Grammatical Error Correction with
Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction.
Our approach creates diverse parallel GEC data without any language-specific operations.
It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z) - A Syntax-Guided Grammatical Error Correction Model with Dependency Tree
Correction [83.14159143179269]
Grammatical Error Correction (GEC) is a task of detecting and correcting grammatical errors in sentences.
We propose a syntax-guided GEC model (SG-GEC) which adopts the graph attention mechanism to utilize the syntactic knowledge of dependency trees.
We evaluate our model on public benchmarks of GEC task and it achieves competitive results.
arXiv Detail & Related papers (2021-11-05T07:07:48Z) - Neural Quality Estimation with Multiple Hypotheses for Grammatical Error
Correction [98.31440090585376]
Grammatical Error Correction (GEC) aims to correct writing errors and help language learners improve their writing skills.
Existing GEC models tend to produce spurious corrections or fail to detect lots of errors.
This paper presents the Neural Verification Network (VERNet) for GEC quality estimation with multiple hypotheses.
arXiv Detail & Related papers (2021-05-10T15:04:25Z) - A Self-Refinement Strategy for Noise Reduction in Grammatical Error
Correction [54.569707226277735]
Existing approaches for grammatical error correction (GEC) rely on supervised learning with manually created GEC datasets.
There is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected.
We propose a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models.
arXiv Detail & Related papers (2020-10-07T04:45:09Z) - Data Weighted Training Strategies for Grammatical Error Correction [8.370770440898454]
We show how to incorporate delta-log-perplexity, a type of example scoring, into a training schedule for Grammatical Error Correction (GEC)
Models trained on scored data achieve state-of-the-art results on common GEC test sets.
arXiv Detail & Related papers (2020-08-07T03:30:14Z) - Towards Minimal Supervision BERT-based Grammar Error Correction [81.90356787324481]
We try to incorporate contextual information from pre-trained language model to leverage annotation and benefit multilingual scenarios.
Results show strong potential of Bidirectional Representations from Transformers (BERT) in grammatical error correction task.
arXiv Detail & Related papers (2020-01-10T15:45:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.