A Self-Refinement Strategy for Noise Reduction in Grammatical Error
Correction
- URL: http://arxiv.org/abs/2010.03155v1
- Date: Wed, 7 Oct 2020 04:45:09 GMT
- Title: A Self-Refinement Strategy for Noise Reduction in Grammatical Error
Correction
- Authors: Masato Mita, Shun Kiyono, Masahiro Kaneko, Jun Suzuki and Kentaro Inui
- Abstract summary: Existing approaches for grammatical error correction (GEC) rely on supervised learning with manually created GEC datasets.
There is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected.
We propose a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models.
- Score: 54.569707226277735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing approaches for grammatical error correction (GEC) largely rely on
supervised learning with manually created GEC datasets. However, there has been
little focus on verifying and ensuring the quality of the datasets, and on how
lower-quality data might affect GEC performance. We indeed found that there is
a non-negligible amount of "noise" where errors were inappropriately edited or
left uncorrected. To address this, we designed a self-refinement method where
the key idea is to denoise these datasets by leveraging the prediction
consistency of existing models, and outperformed strong denoising baseline
methods. We further applied task-specific techniques and achieved
state-of-the-art performance on the CoNLL-2014, JFLEG, and BEA-2019 benchmarks.
We then analyzed the effect of the proposed denoising method, and found that
our approach leads to improved coverage of corrections and facilitated fluency
edits which are reflected in higher recall and overall performance.
Related papers
- LM-Combiner: A Contextual Rewriting Model for Chinese Grammatical Error Correction [49.0746090186582]
Over-correction is a critical problem in Chinese grammatical error correction (CGEC) task.
Recent work using model ensemble methods can effectively mitigate over-correction and improve the precision of the GEC system.
We propose the LM-Combiner, a rewriting model that can directly modify the over-correction of GEC system outputs without a model ensemble.
arXiv Detail & Related papers (2024-03-26T06:12:21Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Fine tuning Pre trained Models for Robustness Under Noisy Labels [34.68018860186995]
The presence of noisy labels in a training dataset can significantly impact the performance of machine learning models.
We introduce a novel algorithm called TURN, which robustly and efficiently transfers the prior knowledge of pre-trained models.
arXiv Detail & Related papers (2023-10-24T20:28:59Z) - PREM: A Simple Yet Effective Approach for Node-Level Graph Anomaly
Detection [65.24854366973794]
Node-level graph anomaly detection (GAD) plays a critical role in identifying anomalous nodes from graph-structured data in domains such as medicine, social networks, and e-commerce.
We introduce a simple method termed PREprocessing and Matching (PREM for short) to improve the efficiency of GAD.
Our approach streamlines GAD, reducing time and memory consumption while maintaining powerful anomaly detection capabilities.
arXiv Detail & Related papers (2023-10-18T02:59:57Z) - Contrastive Error Attribution for Finetuned Language Models [35.80256755393739]
noisy and misannotated data is a core cause of hallucinations and unfaithful outputs in Natural Language Generation (NLG) tasks.
We introduce a framework to identify and remove low-quality training instances that lead to undesirable outputs.
We show that existing approaches for error tracing, such as gradient-based influence measures, do not perform reliably for detecting faithfulness errors.
arXiv Detail & Related papers (2022-12-21T02:28:07Z) - Continual Learning For On-Device Environmental Sound Classification [63.81276321857279]
We propose a simple and efficient continual learning method for on-device environmental sound classification.
Our method selects the historical data for the training by measuring the per-sample classification uncertainty.
arXiv Detail & Related papers (2022-07-15T12:13:04Z) - Dataset Condensation with Contrastive Signals [41.195453119305746]
gradient matching-based dataset synthesis (DC) methods can achieve state-of-the-art performance when applied to data-efficient learning tasks.
In this study, we prove that the existing DC methods can perform worse than the random selection method when task-irrelevant information forms a significant part of the training dataset.
We propose dataset condensation with Contrastive signals (DCC) by modifying the loss function to enable the DC methods to effectively capture the differences between classes.
arXiv Detail & Related papers (2022-02-07T03:05:32Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Data Weighted Training Strategies for Grammatical Error Correction [8.370770440898454]
We show how to incorporate delta-log-perplexity, a type of example scoring, into a training schedule for Grammatical Error Correction (GEC)
Models trained on scored data achieve state-of-the-art results on common GEC test sets.
arXiv Detail & Related papers (2020-08-07T03:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.