Spelling Error Correction with Soft-Masked BERT
- URL: http://arxiv.org/abs/2005.07421v1
- Date: Fri, 15 May 2020 09:02:38 GMT
- Title: Spelling Error Correction with Soft-Masked BERT
- Authors: Shaohua Zhang, Haoran Huang, Jicong Liu and Hang Li
- Abstract summary: A state-of-the-art method for the task selects a character from a list of candidates for correction at each position of the sentence on the basis of BERT.
The accuracy of the method can be sub-optimal because BERT does not have sufficient capability to detect whether there is an error at each position.
We propose a novel neural architecture to address the aforementioned issue, which consists of a network for error detection and a network for error correction based on BERT.
- Score: 11.122964733563117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spelling error correction is an important yet challenging task because a
satisfactory solution of it essentially needs human-level language
understanding ability. Without loss of generality we consider Chinese spelling
error correction (CSC) in this paper. A state-of-the-art method for the task
selects a character from a list of candidates for correction (including
non-correction) at each position of the sentence on the basis of BERT, the
language representation model. The accuracy of the method can be sub-optimal,
however, because BERT does not have sufficient capability to detect whether
there is an error at each position, apparently due to the way of pre-training
it using mask language modeling. In this work, we propose a novel neural
architecture to address the aforementioned issue, which consists of a network
for error detection and a network for error correction based on BERT, with the
former being connected to the latter with what we call soft-masking technique.
Our method of using `Soft-Masked BERT' is general, and it may be employed in
other language detection-correction problems. Experimental results on two
datasets demonstrate that the performance of our proposed method is
significantly better than the baselines including the one solely based on BERT.
Related papers
- A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction [79.52464132360618]
Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task.
We introduce a novel approach based on error detector-corrector framework.
Our detector is designed to yield two error detection results, each characterized by high precision and recall.
arXiv Detail & Related papers (2024-09-06T09:26:45Z) - A Comprehensive Approach to Misspelling Correction with BERT and Levenshtein Distance [1.7000578646860536]
Spelling mistakes, among the most prevalent writing errors, are frequently encountered due to various factors.
This research aims to identify and rectify diverse spelling errors in text using neural networks.
arXiv Detail & Related papers (2024-07-24T16:07:11Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Improving Pre-trained Language Models with Syntactic Dependency
Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors.
Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z) - Exploring the Capacity of a Large-scale Masked Language Model to
Recognize Grammatical Errors [3.55517579369797]
We show that 5 to 10% of training data are enough for a BERT-based error detection method to achieve performance equivalent to a non-language model-based method.
We also show with pseudo error data that it actually exhibits such nice properties in learning rules for recognizing various types of error.
arXiv Detail & Related papers (2021-08-27T10:37:14Z) - Tail-to-Tail Non-Autoregressive Sequence Prediction for Chinese
Grammatical Error Correction [49.25830718574892]
We present a new framework named Tail-to-Tail (textbfTtT) non-autoregressive sequence prediction.
Considering that most tokens are correct and can be conveyed directly from source to target, and the error positions can be estimated and corrected.
Experimental results on standard datasets, especially on the variable-length datasets, demonstrate the effectiveness of TtT in terms of sentence-level Accuracy, Precision, Recall, and F1-Measure.
arXiv Detail & Related papers (2021-06-03T05:56:57Z) - Misspelling Correction with Pre-trained Contextual Language Model [0.0]
We present two experiments, based on BERT and the edit distance algorithm, for ranking and selecting candidate corrections.
The results of our experiments demonstrated that when combined properly, contextual word embeddings of BERT and edit distance are capable of effectively correcting spelling errors.
arXiv Detail & Related papers (2021-01-08T20:11:01Z) - Improving the Efficiency of Grammatical Error Correction with Erroneous
Span Detection and Correction [106.63733511672721]
We propose a novel language-independent approach to improve the efficiency for Grammatical Error Correction (GEC) by dividing the task into two subtasks: Erroneous Span Detection ( ESD) and Erroneous Span Correction (ESC)
ESD identifies grammatically incorrect text spans with an efficient sequence tagging model. ESC leverages a seq2seq model to take the sentence with annotated erroneous spans as input and only outputs the corrected text for these spans.
Experiments show our approach performs comparably to conventional seq2seq approaches in both English and Chinese GEC benchmarks with less than 50% time cost for inference.
arXiv Detail & Related papers (2020-10-07T08:29:11Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.