Error-Robust Retrieval for Chinese Spelling Check
- URL: http://arxiv.org/abs/2211.07843v2
- Date: Sun, 25 Feb 2024 22:17:52 GMT
- Title: Error-Robust Retrieval for Chinese Spelling Check
- Authors: Xunjian Yin and Xinyu Hu and Jin Jiang and Xiaojun Wan
- Abstract summary: Chinese Spelling Check (CSC) aims to detect and correct error tokens in Chinese contexts.
Previous methods may not fully leverage the existing datasets.
We introduce our plug-and-play retrieval method with error-robust information for Chinese Spelling Check.
- Score: 43.56073620728942
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chinese Spelling Check (CSC) aims to detect and correct error tokens in
Chinese contexts, which has a wide range of applications. However, it is
confronted with the challenges of insufficient annotated data and the issue
that previous methods may actually not fully leverage the existing datasets. In
this paper, we introduce our plug-and-play retrieval method with error-robust
information for Chinese Spelling Check (RERIC), which can be directly applied
to existing CSC models. The datastore for retrieval is built completely based
on the training data, with elaborate designs according to the characteristics
of CSC. Specifically, we employ multimodal representations that fuse phonetic,
morphologic, and contextual information in the calculation of query and key
during retrieval to enhance robustness against potential errors. Furthermore,
in order to better judge the retrieved candidates, the n-gram surrounding the
token to be checked is regarded as the value and utilized for specific
reranking. The experiment results on the SIGHAN benchmarks demonstrate that our
proposed method achieves substantial improvements over existing work.
Related papers
- EdaCSC: Two Easy Data Augmentation Methods for Chinese Spelling Correction [0.0]
Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in Chinese sentences caused by phonetic or visual similarities.
We propose two data augmentation methods to address these limitations.
Firstly, we augment the dataset by either splitting long sentences into shorter ones or reducing typos in sentences with multiple typos.
arXiv Detail & Related papers (2024-09-08T14:29:10Z) - C-LLM: Learn to Check Chinese Spelling Errors Character by Character [61.53865964535705]
We propose C-LLM, a Large Language Model-based Chinese Spell Checking method that learns to check errors Character by Character.
C-LLM achieves an average improvement of 10% over existing methods.
arXiv Detail & Related papers (2024-06-24T11:16:31Z) - Understanding and Mitigating Classification Errors Through Interpretable
Token Patterns [58.91023283103762]
Characterizing errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors.
We propose to discover those patterns of tokens that distinguish correct and erroneous predictions.
We show that our method, Premise, performs well in practice.
arXiv Detail & Related papers (2023-11-18T00:24:26Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Block the Label and Noise: An N-Gram Masked Speller for Chinese Spell
Checking [0.0]
This paper proposes an n-gram masking layer that masks current and/or surrounding tokens to avoid label leakage and error disturbance.
Experiments on SIGHAN datasets have demonstrated that the pluggable n-gram masking mechanism can improve the performance of prevalent CSC models.
arXiv Detail & Related papers (2023-05-05T06:43:56Z) - CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers [62.61866477815883]
We present CSCD-NS, the first Chinese spelling check dataset designed for native speakers.
CSCD-NS is ten times larger in scale and exhibits a distinct error distribution.
We propose a novel method that simulates the input process through an input method.
arXiv Detail & Related papers (2022-11-16T09:25:42Z) - uChecker: Masked Pretrained Language Models as Unsupervised Chinese
Spelling Checkers [23.343006562849126]
We propose a framework named textbfuChecker to conduct unsupervised spelling error detection and correction.
Masked pretrained language models such as BERT are introduced as the backbone model.
Benefiting from the various and flexible MASKing operations, we propose a Confusionset-guided masking strategy to fine-train the masked language model.
arXiv Detail & Related papers (2022-09-15T05:57:12Z) - Improving Pre-trained Language Models with Syntactic Dependency
Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors.
Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.