A Simple and Practical Approach to Improve Misspellings in OCR Text
- URL: http://arxiv.org/abs/2106.12030v1
- Date: Tue, 22 Jun 2021 19:38:17 GMT
- Title: A Simple and Practical Approach to Improve Misspellings in OCR Text
- Authors: Junxia Lin (1), Johannes Ledolter (2) ((1) Georgetown University
Medical Center, Georgetown University, (2) Tippie College of Business,
University of Iowa)
- Abstract summary: This paper focuses on the identification and correction of non-word errors in OCR text.
Traditional N-gram correction methods can handle single-word errors effectively.
In this paper, we develop an unsupervised method that can handle split and merge errors.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The focus of our paper is the identification and correction of non-word
errors in OCR text. Such errors may be the result of incorrect insertion,
deletion, or substitution of a character, or the transposition of two adjacent
characters within a single word. Or, it can be the result of word boundary
problems that lead to run-on errors and incorrect-split errors. The traditional
N-gram correction methods can handle single-word errors effectively. However,
they show limitations when dealing with split and merge errors. In this paper,
we develop an unsupervised method that can handle both errors. The method we
develop leads to a sizable improvement in the correction rates. This tutorial
paper addresses very difficult word correction problems - namely incorrect
run-on and split errors - and illustrates what needs to be considered when
addressing such problems. We outline a possible approach and assess its success
on a limited study.
Related papers
- Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)
RISE injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation.
Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
arXiv Detail & Related papers (2024-10-09T07:43:38Z) - A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction [79.52464132360618]
Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task.
We introduce a novel approach based on error detector-corrector framework.
Our detector is designed to yield two error detection results, each characterized by high precision and recall.
arXiv Detail & Related papers (2024-09-06T09:26:45Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Correcting Real-Word Spelling Errors: A New Hybrid Approach [1.5469452301122175]
A new hybrid approach is proposed which relies on statistical and syntactic knowledge to detect and correct real-word errors.
The model can prove to be more practical than some other models, such as WordNet-based method of Hirst and Budanitsky and fixed windows size method of Wilcox-O'Hearn and Hirst.
arXiv Detail & Related papers (2023-02-09T06:03:11Z) - Real-Word Error Correction with Trigrams: Correcting Multiple Errors in
a Sentence [0.0]
We propose a new variation which focuses on detecting and correcting multiple real-word errors in a sentence.
We test our approach on the Wall Street Journal corpus and show that it outperforms Hirst and Budanitsky's WordNet-based method and Wilcox-O'Hearn, Hirst, and Budanitsky's fixed windows size method.
arXiv Detail & Related papers (2023-02-07T13:52:14Z) - SoftCorrect: Error Correction with Soft Detection for Automatic Speech
Recognition [116.31926128970585]
We propose SoftCorrect with a soft error detection mechanism to avoid the limitations of both explicit and implicit error detection.
Compared with implicit error detection with CTC loss, SoftCorrect provides explicit signal about which words are incorrect.
Experiments on AISHELL-1 and Aidatatang datasets show that SoftCorrect achieves 26.1% and 9.4% CER reduction respectively.
arXiv Detail & Related papers (2022-12-02T09:11:32Z) - Tokenization Repair in the Presence of Spelling Errors [0.2964978357715083]
Spelling errors can be present, but it's not part of the problem to correct them.
We identify three key ingredients of high-quality tokenization repair.
arXiv Detail & Related papers (2020-10-15T16:55:45Z) - Improving the Efficiency of Grammatical Error Correction with Erroneous
Span Detection and Correction [106.63733511672721]
We propose a novel language-independent approach to improve the efficiency for Grammatical Error Correction (GEC) by dividing the task into two subtasks: Erroneous Span Detection ( ESD) and Erroneous Span Correction (ESC)
ESD identifies grammatically incorrect text spans with an efficient sequence tagging model. ESC leverages a seq2seq model to take the sentence with annotated erroneous spans as input and only outputs the corrected text for these spans.
Experiments show our approach performs comparably to conventional seq2seq approaches in both English and Chinese GEC benchmarks with less than 50% time cost for inference.
arXiv Detail & Related papers (2020-10-07T08:29:11Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z) - Error correction and extraction in request dialogs [12.137183622356197]
Component gets the last two utterances of a user and can detect whether the last utterance is an error correction of the second last utterance.
It corrects the second last utterance according to the error correction in the last utterance and outputs the extracted pairs of reparandum and repair entity.
One error correction detection and one error correction approach can be combined to a pipeline or the error correction approaches can be trained and used end-to-end to avoid two components.
arXiv Detail & Related papers (2020-04-08T20:49:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.