Related papers: C-LLM: Learn to Check Chinese Spelling Errors Character by Character

C-LLM: Learn to Check Chinese Spelling Errors Character by Character

URL: http://arxiv.org/abs/2406.16536v2
Date: Sat, 26 Oct 2024 16:27:46 GMT
Title: C-LLM: Learn to Check Chinese Spelling Errors Character by Character
Authors: Kunting Li, Yong Hu, Liang He, Fandong Meng, Jie Zhou,
Abstract summary: We propose C-LLM, a Large Language Model-based Chinese Spell Checking method that learns to check errors Character by Character. C-LLM achieves an average improvement of 10% over existing methods.
Score: 61.53865964535705
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Chinese Spell Checking (CSC) aims to detect and correct spelling errors in sentences. Despite Large Language Models (LLMs) exhibit robust capabilities and are widely applied in various tasks, their performance on CSC is often unsatisfactory. We find that LLMs fail to meet the Chinese character-level constraints of the CSC task, namely equal length and phonetic similarity, leading to a performance bottleneck. Further analysis reveal that this issue stems from the granularity of tokenization, as current mixed character-word tokenization struggles to satisfy these character-level constraints. To address this issue, we propose C-LLM, a Large Language Model-based Chinese Spell Checking method that learns to check errors Character by Character. Character-level tokenization enables the model to learn character-level alignment, effectively mitigating issues related to character-level constraints. Furthermore, CSC is simplified to replication-dominated and substitution-supplemented tasks. Experiments on two CSC benchmarks demonstrate that C-LLM achieves an average improvement of 10% over existing methods. Specifically, it shows a 2.1% improvement in general scenarios and a significant 12% improvement in vertical domain scenarios, establishing state-of-the-art performance. The source code can be accessed at https://github.com/ktlKTL/C-LLM.

Related papers

Unveiling the Impact of Multimodal Features on Chinese Spelling Correction: From Analysis to Design [6.592255876792784]
Chinese Spelling Correction (CSC) task focuses on detecting and correcting spelling errors in sentences. LLMs face limitations in CSC, particularly over-correction, making them suboptimal for this task. We introduce textbfNamBert, a novel multimodal model for Chinese spelling correction.
arXiv Detail & Related papers (2025-04-10T11:19:09Z)
A Training-free LLM-based Approach to General Chinese Character Error Correction [31.511249971873962]
Chinese spelling correction (CSC) is a crucial task that aims to correct character errors in Chinese text. We introduce the task of General Chinese Character Error Correction (C2EC), which focuses on all three types of character errors. We extend the training-free prompt-free CSC method to C2EC by using Levenshtein distance for handling length changes and leveraging an additional prompt-based large language model (LLM) to improve performance.
arXiv Detail & Related papers (2025-02-21T07:48:54Z)
Enhancing LLM Character-Level Manipulation via Divide and Conquer [74.55804812450164]
Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks. They exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution. We propose Character-Level Manipulation via Divide and Conquer, a novel approach designed to bridge the gap between token-level processing and character-level manipulation.
arXiv Detail & Related papers (2025-02-12T07:37:39Z)
EdaCSC: Two Easy Data Augmentation Methods for Chinese Spelling Correction [0.0]
Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in Chinese sentences caused by phonetic or visual similarities. We propose two data augmentation methods to address these limitations. Firstly, we augment the dataset by either splitting long sentences into shorter ones or reducing typos in sentences with multiple typos.
arXiv Detail & Related papers (2024-09-08T14:29:10Z)
Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence. Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs. We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z)
TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation [53.974228542090046]
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks. Existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes. We propose TagCLIP (Trusty-aware guided CLIP) to address this issue.
arXiv Detail & Related papers (2023-04-15T12:52:23Z)
CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers [62.61866477815883]
We present CSCD-NS, the first Chinese spelling check dataset designed for native speakers. CSCD-NS is ten times larger in scale and exhibits a distinct error distribution. We propose a novel method that simulates the input process through an input method.
arXiv Detail & Related papers (2022-11-16T09:25:42Z)
Error-Robust Retrieval for Chinese Spelling Check [43.56073620728942]
Chinese Spelling Check (CSC) aims to detect and correct error tokens in Chinese contexts. Previous methods may not fully leverage the existing datasets. We introduce our plug-and-play retrieval method with error-robust information for Chinese Spelling Check.
arXiv Detail & Related papers (2022-11-15T01:55:34Z)
Improving Chinese Spelling Check by Character Pronunciation Prediction: The Effects of Adaptivity and Granularity [76.20568599642799]
Chinese spelling check (CSC) is a fundamental NLP task that detects and corrects spelling errors in Chinese texts. In this paper, we consider introducing an auxiliary task of Chinese pronunciation prediction ( CPP) to improve CSC. We propose SCOPE which builds on top of a shared encoder two parallel decoders, one for the primary CSC task and the other for a fine-grained auxiliary CPP task.
arXiv Detail & Related papers (2022-10-20T03:42:35Z)
Contextual Similarity is More Valuable than Character Similarity: Curriculum Learning for Chinese Spell Checking [26.93594761258908]
Chinese Spell Checking (CSC) task aims to detect and correct Chinese spelling errors. To make better use of contextual similarity, we propose a simple yet effective curriculum learning framework for the CSC task. With the help of our designed model-agnostic framework, existing CSC models will be trained from easy to difficult as humans learn Chinese characters.
arXiv Detail & Related papers (2022-07-17T03:12:27Z)
Improving Pre-trained Language Models with Syntactic Dependency Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors. Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z)
The Past Mistake is the Future Wisdom: Error-driven Contrastive Probability Optimization for Chinese Spell Checking [32.8563506271794]
Chinese Spell Checking (CSC) aims to detect and correct Chinese spelling errors. Pre-trained language models (PLMs) promote the progress of CSC task. We propose an Error-driven COntrastive Probability Optimization framework for CSC task.
arXiv Detail & Related papers (2022-03-02T09:58:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.