uChecker: Masked Pretrained Language Models as Unsupervised Chinese
Spelling Checkers
- URL: http://arxiv.org/abs/2209.07068v1
- Date: Thu, 15 Sep 2022 05:57:12 GMT
- Title: uChecker: Masked Pretrained Language Models as Unsupervised Chinese
Spelling Checkers
- Authors: Piji Li
- Abstract summary: We propose a framework named textbfuChecker to conduct unsupervised spelling error detection and correction.
Masked pretrained language models such as BERT are introduced as the backbone model.
Benefiting from the various and flexible MASKing operations, we propose a Confusionset-guided masking strategy to fine-train the masked language model.
- Score: 23.343006562849126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of Chinese Spelling Check (CSC) is aiming to detect and correct
spelling errors that can be found in the text. While manually annotating a
high-quality dataset is expensive and time-consuming, thus the scale of the
training dataset is usually very small (e.g., SIGHAN15 only contains 2339
samples for training), therefore supervised-learning based models usually
suffer the data sparsity limitation and over-fitting issue, especially in the
era of big language models. In this paper, we are dedicated to investigating
the \textbf{unsupervised} paradigm to address the CSC problem and we propose a
framework named \textbf{uChecker} to conduct unsupervised spelling error
detection and correction. Masked pretrained language models such as BERT are
introduced as the backbone model considering their powerful language diagnosis
capability. Benefiting from the various and flexible MASKing operations, we
propose a Confusionset-guided masking strategy to fine-train the masked
language model to further improve the performance of unsupervised detection and
correction. Experimental results on standard datasets demonstrate the
effectiveness of our proposed model uChecker in terms of character-level and
sentence-level Accuracy, Precision, Recall, and F1-Measure on tasks of spelling
error detection and correction respectively.
Related papers
- Detection-Correction Structure via General Language Model for Grammatical Error Correction [22.609760120265587]
This paper introduces an integrated detection-correction structure, named DeCoGLM, based on the General Language Model (GLM)
The detection phase employs a fault-tolerant detection template, while the correction phase leverages autoregressive mask infilling for localized error correction.
Our model demonstrates competitive performance against the state-of-the-art models on English and Chinese GEC datasets.
arXiv Detail & Related papers (2024-05-28T04:04:40Z) - Contextual Spelling Correction with Language Model for Low-resource Setting [0.0]
A small-scale word-based transformer LM is trained to provide the SC model with contextual understanding.
Probability of error happening(error model) is extracted from the corpus.
Combination of LM and error model is used to develop the SC model through the well-known noisy channel framework.
arXiv Detail & Related papers (2024-04-28T05:29:35Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Unsupervised Calibration through Prior Adaptation for Text
Classification using Large Language Models [37.39843935632105]
We propose an approach to adapt the prior class distribution to perform text classification tasks without the need for labelled samples.
Results show that these methods outperform the un-adapted model for different number of training shots in the prompt.
arXiv Detail & Related papers (2023-07-13T12:11:36Z) - Rethinking Masked Language Modeling for Chinese Spelling Correction [70.85829000570203]
We study Chinese Spelling Correction (CSC) as a joint decision made by two separate models: a language model and an error model.
We find that fine-tuning BERT tends to over-fit the error model while under-fit the language model, resulting in poor generalization to out-of-distribution error patterns.
We demonstrate that a very simple strategy, randomly masking 20% non-error tokens from the input sequence during fine-tuning is sufficient for learning a much better language model without sacrificing the error model.
arXiv Detail & Related papers (2023-05-28T13:19:12Z) - Error-Robust Retrieval for Chinese Spelling Check [43.56073620728942]
Chinese Spelling Check (CSC) aims to detect and correct error tokens in Chinese contexts.
Previous methods may not fully leverage the existing datasets.
We introduce our plug-and-play retrieval method with error-robust information for Chinese Spelling Check.
arXiv Detail & Related papers (2022-11-15T01:55:34Z) - Bridging the Gap Between Training and Inference of Bayesian Controllable
Language Models [58.990214815032495]
Large-scale pre-trained language models have achieved great success on natural language generation tasks.
BCLMs have been shown to be efficient in controllable language generation.
We propose a "Gemini Discriminator" for controllable language generation which alleviates the mismatch problem with a small computational cost.
arXiv Detail & Related papers (2022-06-11T12:52:32Z) - Improving Pre-trained Language Models with Syntactic Dependency
Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors.
Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z) - NLI Data Sanity Check: Assessing the Effect of Data Corruption on Model
Performance [3.7024660695776066]
We propose a new diagnostics test suite which allows to assess whether a dataset constitutes a good testbed for evaluating the models' meaning understanding capabilities.
We specifically apply controlled corruption transformations to widely used benchmarks (MNLI and ANLI)
A large decrease in model accuracy indicates that the original dataset provides a proper challenge to the models' reasoning capabilities.
arXiv Detail & Related papers (2021-04-10T12:28:07Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z) - Towards Minimal Supervision BERT-based Grammar Error Correction [81.90356787324481]
We try to incorporate contextual information from pre-trained language model to leverage annotation and benefit multilingual scenarios.
Results show strong potential of Bidirectional Representations from Transformers (BERT) in grammatical error correction task.
arXiv Detail & Related papers (2020-01-10T15:45:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.