Investigating Glyph Phonetic Information for Chinese Spell Checking:
What Works and What's Next
- URL: http://arxiv.org/abs/2212.04068v3
- Date: Sun, 21 May 2023 14:55:37 GMT
- Title: Investigating Glyph Phonetic Information for Chinese Spell Checking:
What Works and What's Next
- Authors: Xiaotian Zhang, Yanjun Zheng, Hang Yan, Xipeng Qiu
- Abstract summary: We discuss the role of glyph-phonetic information in the Chinese Spell Checking (CSC) task.
We propose a new, more challenging, and practical setting for testing the generalizability of CSC models.
- Score: 48.12125502456953
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While pre-trained Chinese language models have demonstrated impressive
performance on a wide range of NLP tasks, the Chinese Spell Checking (CSC) task
remains a challenge. Previous research has explored using information such as
glyphs and phonetics to improve the ability to distinguish misspelled
characters, with good results. However, the generalization ability of these
models is not well understood: it is unclear whether they incorporate
glyph-phonetic information and, if so, whether this information is fully
utilized. In this paper, we aim to better understand the role of glyph-phonetic
information in the CSC task and suggest directions for improvement.
Additionally, we propose a new, more challenging, and practical setting for
testing the generalizability of CSC models. All code is made publicly
available.
Related papers
- The Impact of Visual Information in Chinese Characters: Evaluating Large Models' Ability to Recognize and Utilize Radicals [17.24821720084663]
We evaluate Large Language Models' and Vision-Language Models' understanding of visual elements in Chinese characters.
Our results reveal that models surprisingly exhibit some, but still limited, knowledge of the visual information.
We observe consistent improvement in Part-Of-Speech tagging when providing additional information about radicals.
arXiv Detail & Related papers (2024-10-11T17:30:02Z) - C-LLM: Learn to Check Chinese Spelling Errors Character by Character [61.53865964535705]
We propose C-LLM, a Large Language Model-based Chinese Spell Checking method that learns to check errors Character by Character.
C-LLM achieves an average improvement of 10% over existing methods.
arXiv Detail & Related papers (2024-06-24T11:16:31Z) - Identifying and Analyzing Task-Encoding Tokens in Large Language Models [55.03191279766383]
In this paper, we identify and analyze task-encoding tokens on whose representations the task performance depends.
We show that template and stopword tokens are the most prone to be task-encoding.
Our work sheds light on how large language models (LLMs) learn to perform a task from demonstrations, deepens our understanding of the varied roles different types of tokens play in LLMs, and provides insights for avoiding instability from improperly utilizing task-encoding tokens.
arXiv Detail & Related papers (2024-01-20T20:55:21Z) - UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding [84.83494254263138]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.
Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - Error-Robust Retrieval for Chinese Spelling Check [43.56073620728942]
Chinese Spelling Check (CSC) aims to detect and correct error tokens in Chinese contexts.
Previous methods may not fully leverage the existing datasets.
We introduce our plug-and-play retrieval method with error-robust information for Chinese Spelling Check.
arXiv Detail & Related papers (2022-11-15T01:55:34Z) - Improving Chinese Spelling Check by Character Pronunciation Prediction:
The Effects of Adaptivity and Granularity [76.20568599642799]
Chinese spelling check (CSC) is a fundamental NLP task that detects and corrects spelling errors in Chinese texts.
In this paper, we consider introducing an auxiliary task of Chinese pronunciation prediction ( CPP) to improve CSC.
We propose SCOPE which builds on top of a shared encoder two parallel decoders, one for the primary CSC task and the other for a fine-grained auxiliary CPP task.
arXiv Detail & Related papers (2022-10-20T03:42:35Z) - Learning from the Dictionary: Heterogeneous Knowledge Guided Fine-tuning
for Chinese Spell Checking [32.16787396943434]
Chinese Spell Checking (CSC) aims to detect and correct Chinese spelling errors.
Recent researches start from the pretrained knowledge of language models and take multimodal information into CSC models to improve the performance.
We propose the LEAD framework, which renders the CSC model to learn heterogeneous knowledge from the dictionary in terms of phonetics, vision, and meaning.
arXiv Detail & Related papers (2022-10-19T06:31:34Z) - Contextual Similarity is More Valuable than Character Similarity:
Curriculum Learning for Chinese Spell Checking [26.93594761258908]
Chinese Spell Checking (CSC) task aims to detect and correct Chinese spelling errors.
To make better use of contextual similarity, we propose a simple yet effective curriculum learning framework for the CSC task.
With the help of our designed model-agnostic framework, existing CSC models will be trained from easy to difficult as humans learn Chinese characters.
arXiv Detail & Related papers (2022-07-17T03:12:27Z) - Read, Listen, and See: Leveraging Multimodal Information Helps Chinese
Spell Checking [20.74049189959078]
We propose a Chinese spell checker called ReaLiSe, by directly leveraging the multimodal information of the Chinese characters.
The ReaLiSe tackles model the CSC task by (1) capturing the semantic, phonetic and graphic information of the input characters, and (2) mixing the information in these modalities to predict the correct output.
Experiments on the SIGHAN benchmarks show that the proposed model outperforms strong baselines by a large margin.
arXiv Detail & Related papers (2021-05-26T02:38:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.