CLFEC: A New Task for Unified Linguistic and Factual Error Correction in paragraph-level Chinese Professional Writing
- URL: http://arxiv.org/abs/2602.23845v1
- Date: Fri, 27 Feb 2026 09:36:05 GMT
- Title: CLFEC: A New Task for Unified Linguistic and Factual Error Correction in paragraph-level Chinese Professional Writing
- Authors: Jian Kai, Zidong Zhang, Jiwen Chen, Zhengxiang Wu, Songtao Sun, Fuyang Li, Yang Cao, Qiang Liu,
- Abstract summary: In paragraph-level Chinese professional writing, linguistic (word/grammar/punctuation) and factual errors frequently co-occur and interact.<n>This paper introduces CLFEC (Chinese Linguistic & Factual Error Correction), a new task for joint linguistic and factual correction.<n>We construct a mixed, multi-domain Chinese professional writing dataset spanning current affairs, finance, law, and medicine.
- Score: 8.863678336953036
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chinese text correction has traditionally focused on spelling and grammar, while factual error correction is usually treated separately. However, in paragraph-level Chinese professional writing, linguistic (word/grammar/punctuation) and factual errors frequently co-occur and interact, making unified correction both necessary and challenging. This paper introduces CLFEC (Chinese Linguistic & Factual Error Correction), a new task for joint linguistic and factual correction. We construct a mixed, multi-domain Chinese professional writing dataset spanning current affairs, finance, law, and medicine. We then conduct a systematic study of LLM-based correction paradigms, from prompting to retrieval-augmented generation (RAG) and agentic workflows. The analysis reveals practical challenges, including limited generalization of specialized correction models, the need for evidence grounding for factual repair, the difficulty of mixed-error paragraphs, and over-correction on clean inputs. Results further show that handling linguistic and factual Error within the same context outperform decoupled processes, and that agentic workflows can be effective with suitable backbone models. Overall, our dataset and empirical findings provide guidance for building reliable, fully automatic proofreading systems in industrial settings.
Related papers
- TranslationCorrect: A Unified Framework for Machine Translation Post-Editing with Predictive Error Assistance [5.306276499628096]
Machine translation (MT) post-editing and research data collection often rely on inefficient translation, disconnected.<n>We introduce TranslationCorrect, an integrated framework designed to streamline these tasks.<n>It combines MT generation using models like NLLB, automated error prediction using models like XCOMET or LLM APIs (providing detailed reasoning), and an intuitive post-editing interface within a single environment.
arXiv Detail & Related papers (2025-06-23T06:38:49Z) - Chain of Correction for Full-text Speech Recognition with Large Language Models [21.37485126269991]
Chain of Correction (CoC) is a multi-turn chat format to correct errors segment by segment, guided by pre-recognized text and full-text context for better semantic understanding.<n> Experiments show that CoC significantly outperforms baseline and benchmark systems in correcting full-text ASR outputs.
arXiv Detail & Related papers (2025-04-02T09:06:23Z) - Tgea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models [57.758735361535486]
TGEA is an error-annotated dataset for text generation from pretrained language models (PLMs)<n>We create an error taxonomy to cover 24 types of errors occurring in PLM-generated sentences.<n>This is the first dataset with comprehensive annotations for PLM-generated texts.
arXiv Detail & Related papers (2025-03-06T09:14:02Z) - Loss-Aware Curriculum Learning for Chinese Grammatical Error Correction [21.82403446634522]
Chinese grammatical error correction (CGEC) aims to detect and correct errors in the input Chinese sentences.<n>Current approaches ignore that correction difficulty varies across different instances and treat these samples equally.<n>We propose a multi-granularity Curriculum Learning framework to address this problem.
arXiv Detail & Related papers (2024-12-31T08:11:49Z) - Learning from Mistakes: Self-correct Adversarial Training for Chinese Unnatural Text Correction [6.426690600216749]
Unnatural text correction aims to automatically detect and correct spelling errors or adversarial perturbation errors in sentences.<n>Existing methods rely on fine-tuning or adversarial training to correct errors.<n>We propose a self-correct adversarial training framework for textbfLearntextbfIng from textbfMIstextbfTakes.
arXiv Detail & Related papers (2024-12-23T04:58:58Z) - Full-text Error Correction for Chinese Speech Recognition with Large Language Model [11.287933170894311]
Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR)<n>This paper investigates the effectiveness of LLMs for error correction in full-text generated by ASR systems from longer speech recordings.
arXiv Detail & Related papers (2024-09-12T06:50:45Z) - Understanding and Mitigating Classification Errors Through Interpretable
Token Patterns [58.91023283103762]
Characterizing errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors.
We propose to discover those patterns of tokens that distinguish correct and erroneous predictions.
We show that our method, Premise, performs well in practice.
arXiv Detail & Related papers (2023-11-18T00:24:26Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Factual Error Correction for Abstractive Summaries Using Entity
Retrieval [57.01193722520597]
We propose an efficient factual error correction system RFEC based on entities retrieval post-editing process.
RFEC retrieves the evidence sentences from the original document by comparing the sentences with the target summary.
Next, RFEC detects the entity-level errors in the summaries by considering the evidence sentences and substitutes the wrong entities with the accurate entities from the evidence sentences.
arXiv Detail & Related papers (2022-04-18T11:35:02Z) - Improving Pre-trained Language Models with Syntactic Dependency
Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors.
Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.