Mask the Correct Tokens: An Embarrassingly Simple Approach for Error
Correction
- URL: http://arxiv.org/abs/2211.13252v1
- Date: Wed, 23 Nov 2022 19:05:48 GMT
- Title: Mask the Correct Tokens: An Embarrassingly Simple Approach for Error
Correction
- Authors: Kai Shen, Yichong Leng, Xu Tan, Siliang Tang, Yuan Zhang, Wenjie Liu,
Edward Lin
- Abstract summary: Previous error correction methods usually take the source (incorrect) sentence as encoder input and generate the target (correct) sentence through the decoder.
We propose a simple yet effective masking strategy to achieve this goal.
- Score: 38.463639262607174
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text error correction aims to correct the errors in text sequences such as
those typed by humans or generated by speech recognition models. Previous error
correction methods usually take the source (incorrect) sentence as encoder
input and generate the target (correct) sentence through the decoder. Since the
error rate of the incorrect sentence is usually low (e.g., 10\%), the
correction model can only learn to correct on limited error tokens but
trivially copy on most tokens (correct tokens), which harms the effective
training of error correction. In this paper, we argue that the correct tokens
should be better utilized to facilitate effective training and then propose a
simple yet effective masking strategy to achieve this goal. Specifically, we
randomly mask out a part of the correct tokens in the source sentence and let
the model learn to not only correct the original error tokens but also predict
the masked tokens based on their context information. Our method enjoys several
advantages: 1) it alleviates trivial copy; 2) it leverages effective training
signals from correct tokens; 3) it is a plug-and-play module and can be applied
to different models and tasks. Experiments on spelling error correction and
speech recognition error correction on Mandarin datasets and grammar error
correction on English datasets with both autoregressive and non-autoregressive
generation models show that our method improves the correction accuracy
consistently.
Related papers
- Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)
RISE injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation.
Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
arXiv Detail & Related papers (2024-10-09T07:43:38Z) - Alirector: Alignment-Enhanced Chinese Grammatical Error Corrector [25.450566841158864]
Chinese grammatical error correction (CGEC) faces serious overcorrection challenges when employing autoregressive generative models.
We propose an alignment-enhanced corrector for the overcorrection problem.
Experimental results on three CGEC datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-02-07T05:56:54Z) - An Error-Guided Correction Model for Chinese Spelling Error Correction [13.56600372085612]
We propose an error-guided correction model (EGCM) to improve Chinese spelling correction.
Our model achieves superior performance against state-of-the-art approaches by a remarkable margin.
arXiv Detail & Related papers (2023-01-16T09:27:45Z) - SoftCorrect: Error Correction with Soft Detection for Automatic Speech
Recognition [116.31926128970585]
We propose SoftCorrect with a soft error detection mechanism to avoid the limitations of both explicit and implicit error detection.
Compared with implicit error detection with CTC loss, SoftCorrect provides explicit signal about which words are incorrect.
Experiments on AISHELL-1 and Aidatatang datasets show that SoftCorrect achieves 26.1% and 9.4% CER reduction respectively.
arXiv Detail & Related papers (2022-12-02T09:11:32Z) - From Spelling to Grammar: A New Framework for Chinese Grammatical Error
Correction [12.170714706174314]
Chinese Grammatical Error Correction (CGEC) aims to generate a correct sentence from an erroneous sequence.
This paper divides the CGEC task into two steps, namely spelling error correction and grammatical error correction.
We propose a novel zero-shot approach for spelling error correction, which is simple but effective.
To handle grammatical error correction, we design part-of-speech features and semantic class features to enhance the neural network model.
arXiv Detail & Related papers (2022-11-03T07:30:09Z) - FastCorrect: Fast Error Correction with Edit Alignment for Automatic
Speech Recognition [90.34177266618143]
We propose FastCorrect, a novel NAR error correction model based on edit alignment.
FastCorrect speeds up the inference by 6-9 times and maintains the accuracy (8-14% WER reduction) compared with the autoregressive correction model.
It outperforms the accuracy of popular NAR models adopted in neural machine translation by a large margin.
arXiv Detail & Related papers (2021-05-09T05:35:36Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.