Read, Listen, and See: Leveraging Multimodal Information Helps Chinese
Spell Checking
- URL: http://arxiv.org/abs/2105.12306v1
- Date: Wed, 26 May 2021 02:38:11 GMT
- Title: Read, Listen, and See: Leveraging Multimodal Information Helps Chinese
Spell Checking
- Authors: Heng-Da Xu, Zhongli Li, Qingyu Zhou, Chao Li, Zizhen Wang, Yunbo Cao,
Heyan Huang and Xian-Ling Mao
- Abstract summary: We propose a Chinese spell checker called ReaLiSe, by directly leveraging the multimodal information of the Chinese characters.
The ReaLiSe tackles model the CSC task by (1) capturing the semantic, phonetic and graphic information of the input characters, and (2) mixing the information in these modalities to predict the correct output.
Experiments on the SIGHAN benchmarks show that the proposed model outperforms strong baselines by a large margin.
- Score: 20.74049189959078
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chinese Spell Checking (CSC) aims to detect and correct erroneous characters
for user-generated text in the Chinese language. Most of the Chinese spelling
errors are misused semantically, phonetically or graphically similar
characters. Previous attempts noticed this phenomenon and try to use the
similarity for this task. However, these methods use either heuristics or
handcrafted confusion sets to predict the correct character. In this paper, we
propose a Chinese spell checker called ReaLiSe, by directly leveraging the
multimodal information of the Chinese characters. The ReaLiSe model tackles the
CSC task by (1) capturing the semantic, phonetic and graphic information of the
input characters, and (2) selectively mixing the information in these
modalities to predict the correct output. Experiments on the SIGHAN benchmarks
show that the proposed model outperforms strong baselines by a large margin.
Related papers
- C-LLM: Learn to Check Chinese Spelling Errors Character by Character [61.53865964535705]
We propose C-LLM, a Large Language Model-based Chinese Spell Checking method that learns to check errors Character by Character.
C-LLM achieves an average improvement of 10% over existing methods.
arXiv Detail & Related papers (2024-06-24T11:16:31Z) - Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through
Image-IDS Aligning [61.34060587461462]
We propose a two-stage framework for Chinese Text Recognition (CTR)
We pre-train a CLIP-like model through aligning printed character images and Ideographic Description Sequences (IDS)
This pre-training stage simulates humans recognizing Chinese characters and obtains the canonical representation of each character.
The learned representations are employed to supervise the CTR model, such that traditional single-character recognition can be improved to text-line recognition.
arXiv Detail & Related papers (2023-09-03T05:33:16Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Disentangled Phonetic Representation for Chinese Spelling Correction [25.674770525359236]
Chinese Spelling Correction aims to detect and correct erroneous characters in Chinese texts.
Efforts have been made to introduce phonetic information in this task, but they typically merge phonetic representations with character representations.
We propose to disentangle the two types of features to allow for direct interaction between textual and phonetic information.
arXiv Detail & Related papers (2023-05-24T06:39:12Z) - A Chinese Spelling Check Framework Based on Reverse Contrastive Learning [4.60495447017298]
We present a novel framework for Chinese spelling checking, which consists of three modules: language representation, spelling check and reverse contrastive learning.
Specifically, we propose a reverse contrastive learning strategy, which explicitly forces the model to minimize the agreement between the similar examples.
Experimental results show that our framework is model-agnostic and could be combined with existing Chinese spelling check models to yield state-of-the-art performance.
arXiv Detail & Related papers (2022-10-25T08:05:38Z) - Improving Chinese Spelling Check by Character Pronunciation Prediction:
The Effects of Adaptivity and Granularity [76.20568599642799]
Chinese spelling check (CSC) is a fundamental NLP task that detects and corrects spelling errors in Chinese texts.
In this paper, we consider introducing an auxiliary task of Chinese pronunciation prediction ( CPP) to improve CSC.
We propose SCOPE which builds on top of a shared encoder two parallel decoders, one for the primary CSC task and the other for a fine-grained auxiliary CPP task.
arXiv Detail & Related papers (2022-10-20T03:42:35Z) - ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information [32.70080326854314]
We propose ChineseBERT, which incorporates the glyph and pinyin information of Chinese characters into language model pretraining.
The proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps.
arXiv Detail & Related papers (2021-06-30T13:06:00Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - 2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts.
Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z) - SpellGCN: Incorporating Phonological and Visual Similarities into
Language Models for Chinese Spelling Check [28.446849414110297]
Chinese Spelling Check (CSC) is a task to detect and correct spelling errors in Chinese natural language.
Existing methods have made attempts to incorporate the similarity knowledge between Chinese characters.
This paper proposes to incorporate phonological and visual similarity into language models for CSC via a specialized graph convolutional network (SpellGCN)
arXiv Detail & Related papers (2020-04-26T03:34:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.