Related papers: An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

URL: http://arxiv.org/abs/2011.03502v1
Date: Fri, 6 Nov 2020 18:19:48 GMT
Title: An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish
Authors: Quan Duong, Mika H\"am\"al\"ainen, Simon Hengchen
Abstract summary: Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods. We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model. Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation.
Score: 1.0957528713294875
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction designed for English, and adapt it to Finnish by proposing solutions that take the rich morphology of the language into account. Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation. The source code and models are available on GitHub and Zenodo.

Related papers

TranslationCorrect: A Unified Framework for Machine Translation Post-Editing with Predictive Error Assistance [5.306276499628096]
Machine translation (MT) post-editing and research data collection often rely on inefficient translation, disconnected.<n>We introduce TranslationCorrect, an integrated framework designed to streamline these tasks.<n>It combines MT generation using models like NLLB, automated error prediction using models like XCOMET or LLM APIs (providing detailed reasoning), and an intuitive post-editing interface within a single environment.
arXiv Detail & Related papers (2025-06-23T06:38:49Z)
Design of intelligent proofreading system for English translation based on CNN and BERT [5.498056383808144]
This paper proposes a novel hybrid approach for robust proofreading.<n>It combines convolutional neural networks (CNN) with Bidirectional Representations from Transformers (BERT)<n> Experiments attain a 90% accuracy, 89.37% F1, and 16.24% MSE, exceeding recent proofreading techniques by over 10% overall.
arXiv Detail & Related papers (2025-06-05T09:34:42Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive. LCD can distort the global distribution over strings, sampling tokens based only on local information. We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages [41.09752906121257]
We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR. We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation.
arXiv Detail & Related papers (2024-12-14T19:59:41Z)
LM-Combiner: A Contextual Rewriting Model for Chinese Grammatical Error Correction [49.0746090186582]
Over-correction is a critical problem in Chinese grammatical error correction (CGEC) task. Recent work using model ensemble methods can effectively mitigate over-correction and improve the precision of the GEC system. We propose the LM-Combiner, a rewriting model that can directly modify the over-correction of GEC system outputs without a model ensemble.
arXiv Detail & Related papers (2024-03-26T06:12:21Z)
A Novel Approach for Automatic Program Repair using Round-Trip Translation with Large Language Models [50.86686630756207]
Research shows that grammatical mistakes in a sentence can be corrected by translating it to another language and back. Current generative models for Automatic Program Repair (APR) are pre-trained on source code and fine-tuned for repair. This paper proposes bypassing the fine-tuning step and using Round-Trip Translation (RTT): translation of code from one programming language to another programming or natural language, and back.
arXiv Detail & Related papers (2024-01-15T22:36:31Z)
Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence. Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs. We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z)
An Error-Guided Correction Model for Chinese Spelling Error Correction [13.56600372085612]
We propose an error-guided correction model (EGCM) to improve Chinese spelling correction. Our model achieves superior performance against state-of-the-art approaches by a remarkable margin.
arXiv Detail & Related papers (2023-01-16T09:27:45Z)
Generating Sequences by Learning to Self-Correct [64.0249217590888]
Self-Correction decouples an imperfect base generator from a separate corrector that learns to iteratively correct imperfect generations. We show that Self-Correction improves upon the base generator in three diverse generation tasks.
arXiv Detail & Related papers (2022-10-31T18:09:51Z)
uChecker: Masked Pretrained Language Models as Unsupervised Chinese Spelling Checkers [23.343006562849126]
We propose a framework named textbfuChecker to conduct unsupervised spelling error detection and correction. Masked pretrained language models such as BERT are introduced as the backbone model. Benefiting from the various and flexible MASKing operations, we propose a Confusionset-guided masking strategy to fine-train the masked language model.
arXiv Detail & Related papers (2022-09-15T05:57:12Z)
Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages. We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z)
Non-Parametric Online Learning from Human Feedback for Neural Machine Translation [54.96594148572804]
We study the problem of online learning with human feedback in the human-in-the-loop machine translation. Previous methods require online model updating or additional translation memory networks to achieve high-quality performance. We propose a novel non-parametric online learning method without changing the model structure.
arXiv Detail & Related papers (2021-09-23T04:26:15Z)
End-to-End Lexically Constrained Machine Translation for Morphologically Rich Languages [0.0]
We investigate mechanisms to allow neural machine translation to infer the correct word inflection given lemmatized constraints. Our experiments on the English-Czech language pair show that this approach improves the translation of constrained terms in both automatic and manual evaluation.
arXiv Detail & Related papers (2021-06-23T13:40:13Z)
Empirical Error Modeling Improves Robustness of Noisy Neural Sequence Labeling [26.27504889360246]
We propose an empirical error generation approach that employs a sequence-to-sequence model trained to perform translation from error-free to erroneous text. To overcome the data sparsity problem that exacerbates in the case of imperfect textual input, we learned noisy language model-based embeddings. Our approach outperformed the baseline noise generation and error correction techniques on the erroneous sequence labeling data sets.
arXiv Detail & Related papers (2021-05-25T12:15:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.