An Alignment-Agnostic Model for Chinese Text Error Correction
- URL: http://arxiv.org/abs/2104.07190v1
- Date: Thu, 15 Apr 2021 01:17:34 GMT
- Title: An Alignment-Agnostic Model for Chinese Text Error Correction
- Authors: Liying Zheng, Yue Deng, Weishun Song, Liang Xu, Jing Xiao
- Abstract summary: This paper investigates how to correct Chinese text errors with types of mistaken, missing and redundant characters.
Most existing models can correct mistaken characters errors, but they cannot deal with missing or redundant characters.
We propose a novel detect-correct framework which is alignment-agnostic, meaning that it can handle both text aligned and non-aligned occasions.
- Score: 17.429266115653007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates how to correct Chinese text errors with types of
mistaken, missing and redundant characters, which is common for Chinese native
speakers. Most existing models based on detect-correct framework can correct
mistaken characters errors, but they cannot deal with missing or redundant
characters. The reason is that lengths of sentences before and after correction
are not the same, leading to the inconsistence between model inputs and
outputs. Although the Seq2Seq-based or sequence tagging methods provide
solutions to the problem and achieved relatively good results on English
context, but they do not perform well in Chinese context according to our
experimental results. In our work, we propose a novel detect-correct framework
which is alignment-agnostic, meaning that it can handle both text aligned and
non-aligned occasions, and it can also serve as a cold start model when there
are no annotated data provided. Experimental results on three datasets
demonstrate that our method is effective and achieves the best performance
among existing published models.
Related papers
- EdaCSC: Two Easy Data Augmentation Methods for Chinese Spelling Correction [0.0]
Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in Chinese sentences caused by phonetic or visual similarities.
We propose two data augmentation methods to address these limitations.
Firstly, we augment the dataset by either splitting long sentences into shorter ones or reducing typos in sentences with multiple typos.
arXiv Detail & Related papers (2024-09-08T14:29:10Z) - Is it an i or an l: Test-time Adaptation of Text Line Recognition Models [9.149602257966917]
We introduce the problem of adapting text line recognition models during test time.
We propose an iterative self-training approach that uses feedback from the language model to update the optical model.
Experimental results show that the proposed adaptation method offers an absolute improvement of up to 8% in character error rate.
arXiv Detail & Related papers (2023-08-29T05:44:00Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Towards Fine-Grained Information: Identifying the Type and Location of
Translation Errors [80.22825549235556]
Existing approaches can not synchronously consider error position and type.
We build an FG-TED model to predict the textbf addition and textbfomission errors.
Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results.
arXiv Detail & Related papers (2023-02-17T16:20:33Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - An Error-Guided Correction Model for Chinese Spelling Error Correction [13.56600372085612]
We propose an error-guided correction model (EGCM) to improve Chinese spelling correction.
Our model achieves superior performance against state-of-the-art approaches by a remarkable margin.
arXiv Detail & Related papers (2023-01-16T09:27:45Z) - A Chinese Spelling Check Framework Based on Reverse Contrastive Learning [4.60495447017298]
We present a novel framework for Chinese spelling checking, which consists of three modules: language representation, spelling check and reverse contrastive learning.
Specifically, we propose a reverse contrastive learning strategy, which explicitly forces the model to minimize the agreement between the similar examples.
Experimental results show that our framework is model-agnostic and could be combined with existing Chinese spelling check models to yield state-of-the-art performance.
arXiv Detail & Related papers (2022-10-25T08:05:38Z) - Factual Error Correction for Abstractive Summaries Using Entity
Retrieval [57.01193722520597]
We propose an efficient factual error correction system RFEC based on entities retrieval post-editing process.
RFEC retrieves the evidence sentences from the original document by comparing the sentences with the target summary.
Next, RFEC detects the entity-level errors in the summaries by considering the evidence sentences and substitutes the wrong entities with the accurate entities from the evidence sentences.
arXiv Detail & Related papers (2022-04-18T11:35:02Z) - Improving Pre-trained Language Models with Syntactic Dependency
Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors.
Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.