MixEdit: Revisiting Data Augmentation and Beyond for Grammatical Error
Correction
- URL: http://arxiv.org/abs/2310.11671v1
- Date: Wed, 18 Oct 2023 02:45:51 GMT
- Title: MixEdit: Revisiting Data Augmentation and Beyond for Grammatical Error
Correction
- Authors: Jingheng Ye, Yinghui Li, Yangning Li, Hai-Tao Zheng
- Abstract summary: We propose MixEdit, a data augmentation approach that strategically and dynamically augments realistic data, without requiring extra monolingual corpora.
The results show that MixEdit substantially improves GEC models and is complementary to traditional data augmentation methods.
- Score: 24.370610646959907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data Augmentation through generating pseudo data has been proven effective in
mitigating the challenge of data scarcity in the field of Grammatical Error
Correction (GEC). Various augmentation strategies have been widely explored,
most of which are motivated by two heuristics, i.e., increasing the
distribution similarity and diversity of pseudo data. However, the underlying
mechanism responsible for the effectiveness of these strategies remains poorly
understood. In this paper, we aim to clarify how data augmentation improves GEC
models. To this end, we introduce two interpretable and computationally
efficient measures: Affinity and Diversity. Our findings indicate that an
excellent GEC data augmentation strategy characterized by high Affinity and
appropriate Diversity can better improve the performance of GEC models. Based
on this observation, we propose MixEdit, a data augmentation approach that
strategically and dynamically augments realistic data, without requiring extra
monolingual corpora. To verify the correctness of our findings and the
effectiveness of the proposed MixEdit, we conduct experiments on mainstream
English and Chinese GEC datasets. The results show that MixEdit substantially
improves GEC models and is complementary to traditional data augmentation
methods.
Related papers
- Grammatical Error Correction via Mixed-Grained Weighted Training [68.94921674855621]
Grammatical Error Correction (GEC) aims to automatically correct grammatical errors in natural texts.
MainGEC designs token-level and sentence-level training weights based on inherent discrepancies in accuracy and potential diversity of data annotation.
arXiv Detail & Related papers (2023-11-23T08:34:37Z) - Incorporating Supervised Domain Generalization into Data Augmentation [4.14360329494344]
We propose a method, contrastive semantic alignment(CSA) loss, to improve robustness and training efficiency of data augmentation.
Experiments on the CIFAR-100 and CUB datasets show that the proposed method improves the robustness and training efficiency of typical data augmentations.
arXiv Detail & Related papers (2023-10-02T09:20:12Z) - Deep Generative Modeling-based Data Augmentation with Demonstration
using the BFBT Benchmark Void Fraction Datasets [3.341975883864341]
This paper explores the applications of deep generative models (DGMs) that have been widely used for image data generation to scientific data augmentation.
Once trained, DGMs can be used to generate synthetic data that are similar to the training data and significantly expand the dataset size.
arXiv Detail & Related papers (2023-08-19T22:19:41Z) - Implicit Counterfactual Data Augmentation for Robust Learning [24.795542869249154]
This study proposes an Implicit Counterfactual Data Augmentation method to remove spurious correlations and make stable predictions.
Experiments have been conducted across various biased learning scenarios covering both image and text datasets.
arXiv Detail & Related papers (2023-04-26T10:36:40Z) - AugGPT: Leveraging ChatGPT for Text Data Augmentation [59.76140039943385]
We propose a text data augmentation approach based on ChatGPT (named AugGPT)
AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples.
Experiment results on few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach.
arXiv Detail & Related papers (2023-02-25T06:58:16Z) - Augmentation-Aware Self-Supervision for Data-Efficient GAN Training [68.81471633374393]
Training generative adversarial networks (GANs) with limited data is challenging because the discriminator is prone to overfitting.
We propose a novel augmentation-aware self-supervised discriminator that predicts the augmentation parameter of the augmented data.
We compare our method with state-of-the-art (SOTA) methods using the class-conditional BigGAN and unconditional StyleGAN2 architectures.
arXiv Detail & Related papers (2022-05-31T10:35:55Z) - Learning Representational Invariances for Data-Efficient Action
Recognition [52.23716087656834]
We show that our data augmentation strategy leads to promising performance on the Kinetics-100, UCF-101, and HMDB-51 datasets.
We also validate our data augmentation strategy in the fully supervised setting and demonstrate improved performance.
arXiv Detail & Related papers (2021-03-30T17:59:49Z) - CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for
Natural Language Understanding [67.61357003974153]
We propose a novel data augmentation framework dubbed CoDA.
CoDA synthesizes diverse and informative augmented examples by integrating multiple transformations organically.
A contrastive regularization objective is introduced to capture the global relationship among all the data samples.
arXiv Detail & Related papers (2020-10-16T23:57:03Z) - A Self-Refinement Strategy for Noise Reduction in Grammatical Error
Correction [54.569707226277735]
Existing approaches for grammatical error correction (GEC) rely on supervised learning with manually created GEC datasets.
There is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected.
We propose a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models.
arXiv Detail & Related papers (2020-10-07T04:45:09Z) - WeMix: How to Better Utilize Data Augmentation [36.07712244423405]
We develop a comprehensive analysis that reveals pros and cons of data augmentation.
The main limitation of data augmentation arises from the data bias.
We develop two novel algorithms, termed "AugDrop" and "MixLoss", to correct the data bias in the data augmentation.
arXiv Detail & Related papers (2020-10-03T03:12:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.