Related papers: Tokenization Repair in the Presence of Spelling Errors

Tokenization Repair in the Presence of Spelling Errors

URL: http://arxiv.org/abs/2010.07878v2
Date: Wed, 23 Mar 2022 14:24:16 GMT
Title: Tokenization Repair in the Presence of Spelling Errors
Authors: Hannah Bast, Matthias Hertel, Mostafa M. Mohamed
Abstract summary: Spelling errors can be present, but it's not part of the problem to correct them. We identify three key ingredients of high-quality tokenization repair.
Score: 0.2964978357715083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider the following tokenization repair problem: Given a natural language text with any combination of missing or spurious spaces, correct these. Spelling errors can be present, but it's not part of the problem to correct them. For example, given: "Tispa per isabout token izaionrep air", compute "Tis paper is about tokenizaion repair". We identify three key ingredients of high-quality tokenization repair, all missing from previous work: deep language models with a bidirectional component, training the models on text with spelling errors, and making use of the space information already present. Our methods also improve existing spell checkers by fixing not only more tokenization errors but also more spelling errors: once it is clear which characters form a word, it is much easier for them to figure out the correct word. We provide six benchmarks that cover three use cases (OCR errors, text extraction from PDF, human errors) and the cases of partially correct space information and all spaces missing. We evaluate our methods against the best existing methods and a non-trivial baseline. We provide full reproducibility under https://ad.cs.uni-freiburg.de/publications .

Related papers

Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE) RISE injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation. Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
arXiv Detail & Related papers (2024-10-09T07:43:38Z)
A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction [79.52464132360618]
Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task. We introduce a novel approach based on error detector-corrector framework. Our detector is designed to yield two error detection results, each characterized by high precision and recall.
arXiv Detail & Related papers (2024-09-06T09:26:45Z)
Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence. Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs. We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z)
Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora [0.0]
Grammatical error correction (GEC) is the task of correcting typos, spelling, punctuation and grammatical issues in text. We show that a byte-level model enables higher correction quality than a subword approach.
arXiv Detail & Related papers (2023-05-29T06:35:40Z)
SoftCorrect: Error Correction with Soft Detection for Automatic Speech Recognition [116.31926128970585]
We propose SoftCorrect with a soft error detection mechanism to avoid the limitations of both explicit and implicit error detection. Compared with implicit error detection with CTC loss, SoftCorrect provides explicit signal about which words are incorrect. Experiments on AISHELL-1 and Aidatatang datasets show that SoftCorrect achieves 26.1% and 9.4% CER reduction respectively.
arXiv Detail & Related papers (2022-12-02T09:11:32Z)
Mask the Correct Tokens: An Embarrassingly Simple Approach for Error Correction [38.463639262607174]
Previous error correction methods usually take the source (incorrect) sentence as encoder input and generate the target (correct) sentence through the decoder. We propose a simple yet effective masking strategy to achieve this goal.
arXiv Detail & Related papers (2022-11-23T19:05:48Z)
Correcting Arabic Soft Spelling Mistakes using BiLSTM-based Machine Learning [1.7205106391379026]
Soft spelling errors are widespread among native Arabic speakers and foreign learners alike. We develop, train, evaluate, and compare a set of BiLSTM networks to correct this class of errors. The best model corrects 96.4% of the injected errors and achieves a low character error rate of 1.28% on a real test set of soft spelling mistakes.
arXiv Detail & Related papers (2021-08-02T19:47:55Z)
A Simple and Practical Approach to Improve Misspellings in OCR Text [0.0]
This paper focuses on the identification and correction of non-word errors in OCR text. Traditional N-gram correction methods can handle single-word errors effectively. In this paper, we develop an unsupervised method that can handle split and merge errors.
arXiv Detail & Related papers (2021-06-22T19:38:17Z)
NeuSpell: A Neural Spelling Correction Toolkit [88.79419580807519]
NeuSpell is an open-source toolkit for spelling correction in English. It comprises ten different models, and benchmarks them on misspellings from multiple sources. We train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings.
arXiv Detail & Related papers (2020-10-21T15:53:29Z)
Improving the Efficiency of Grammatical Error Correction with Erroneous Span Detection and Correction [106.63733511672721]
We propose a novel language-independent approach to improve the efficiency for Grammatical Error Correction (GEC) by dividing the task into two subtasks: Erroneous Span Detection ( ESD) and Erroneous Span Correction (ESC) ESD identifies grammatically incorrect text spans with an efficient sequence tagging model. ESC leverages a seq2seq model to take the sentence with annotated erroneous spans as input and only outputs the corrected text for these spans. Experiments show our approach performs comparably to conventional seq2seq approaches in both English and Chinese GEC benchmarks with less than 50% time cost for inference.
arXiv Detail & Related papers (2020-10-07T08:29:11Z)
Domain-shift Conditioning using Adaptable Filtering via Hierarchical Embeddings for Robust Chinese Spell Check [29.041134293160255]
Spell check is a useful application which processes noisy human-generated text. For Chinese spell check, filtering using confusion sets narrows the search space and makes finding corrections easier. We propose a scalable adaptable filter that exploits hierarchical character embeddings to obviate the need to handcraft confusion sets.
arXiv Detail & Related papers (2020-08-27T17:34:40Z)
On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.