An Unsupervised method for OCR Post-Correction and Spelling
Normalisation for Finnish
- URL: http://arxiv.org/abs/2011.03502v1
- Date: Fri, 6 Nov 2020 18:19:48 GMT
- Title: An Unsupervised method for OCR Post-Correction and Spelling
Normalisation for Finnish
- Authors: Quan Duong, Mika H\"am\"al\"ainen, Simon Hengchen
- Abstract summary: Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods.
We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model.
Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation.
- Score: 1.0957528713294875
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Historical corpora are known to contain errors introduced by OCR (optical
character recognition) methods used in the digitization process, often said to
be degrading the performance of NLP systems. Correcting these errors manually
is a time-consuming process and a great part of the automatic approaches have
been relying on rules or supervised machine learning. We build on previous work
on fully automatic unsupervised extraction of parallel data to train a
character-based sequence-to-sequence NMT (neural machine translation) model to
conduct OCR error correction designed for English, and adapt it to Finnish by
proposing solutions that take the rich morphology of the language into account.
Our new method shows increased performance while remaining fully unsupervised,
with the added benefit of spelling normalisation. The source code and models
are available on GitHub and Zenodo.
Related papers
- LM-Combiner: A Contextual Rewriting Model for Chinese Grammatical Error Correction [49.0746090186582]
Over-correction is a critical problem in Chinese grammatical error correction (CGEC) task.
Recent work using model ensemble methods can effectively mitigate over-correction and improve the precision of the GEC system.
We propose the LM-Combiner, a rewriting model that can directly modify the over-correction of GEC system outputs without a model ensemble.
arXiv Detail & Related papers (2024-03-26T06:12:21Z) - A Novel Approach for Automatic Program Repair using Round-Trip
Translation with Large Language Models [50.86686630756207]
Research shows that grammatical mistakes in a sentence can be corrected by translating it to another language and back.
Current generative models for Automatic Program Repair (APR) are pre-trained on source code and fine-tuned for repair.
This paper proposes bypassing the fine-tuning step and using Round-Trip Translation (RTT): translation of code from one programming language to another programming or natural language, and back.
arXiv Detail & Related papers (2024-01-15T22:36:31Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - An Error-Guided Correction Model for Chinese Spelling Error Correction [13.56600372085612]
We propose an error-guided correction model (EGCM) to improve Chinese spelling correction.
Our model achieves superior performance against state-of-the-art approaches by a remarkable margin.
arXiv Detail & Related papers (2023-01-16T09:27:45Z) - Generating Sequences by Learning to Self-Correct [64.0249217590888]
Self-Correction decouples an imperfect base generator from a separate corrector that learns to iteratively correct imperfect generations.
We show that Self-Correction improves upon the base generator in three diverse generation tasks.
arXiv Detail & Related papers (2022-10-31T18:09:51Z) - uChecker: Masked Pretrained Language Models as Unsupervised Chinese
Spelling Checkers [23.343006562849126]
We propose a framework named textbfuChecker to conduct unsupervised spelling error detection and correction.
Masked pretrained language models such as BERT are introduced as the backbone model.
Benefiting from the various and flexible MASKing operations, we propose a Confusionset-guided masking strategy to fine-train the masked language model.
arXiv Detail & Related papers (2022-09-15T05:57:12Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Non-Parametric Online Learning from Human Feedback for Neural Machine
Translation [54.96594148572804]
We study the problem of online learning with human feedback in the human-in-the-loop machine translation.
Previous methods require online model updating or additional translation memory networks to achieve high-quality performance.
We propose a novel non-parametric online learning method without changing the model structure.
arXiv Detail & Related papers (2021-09-23T04:26:15Z) - End-to-End Lexically Constrained Machine Translation for Morphologically
Rich Languages [0.0]
We investigate mechanisms to allow neural machine translation to infer the correct word inflection given lemmatized constraints.
Our experiments on the English-Czech language pair show that this approach improves the translation of constrained terms in both automatic and manual evaluation.
arXiv Detail & Related papers (2021-06-23T13:40:13Z) - Empirical Error Modeling Improves Robustness of Noisy Neural Sequence
Labeling [26.27504889360246]
We propose an empirical error generation approach that employs a sequence-to-sequence model trained to perform translation from error-free to erroneous text.
To overcome the data sparsity problem that exacerbates in the case of imperfect textual input, we learned noisy language model-based embeddings.
Our approach outperformed the baseline noise generation and error correction techniques on the erroneous sequence labeling data sets.
arXiv Detail & Related papers (2021-05-25T12:15:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.