An Assessment of the Impact of OCR Noise on Language Models
- URL: http://arxiv.org/abs/2202.00470v1
- Date: Wed, 26 Jan 2022 21:56:14 GMT
- Title: An Assessment of the Impact of OCR Noise on Language Models
- Authors: Konstantin Todorov and Giovanni Colavizza
- Abstract summary: We perform an assessment of the impact OCR noise has on a variety of language models, using data in Dutch, English, French and German.
We find that OCR noise poses a significant obstacle to language modelling, with language models increasingly diverging from their noiseless targets as OCR quality.
- Score: 0.22843885788439797
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural language models are the backbone of modern-day natural language
processing applications. Their use on textual heritage collections which have
undergone Optical Character Recognition (OCR) is therefore also increasing.
Nevertheless, our understanding of the impact OCR noise could have on language
models is still limited. We perform an assessment of the impact OCR noise has
on a variety of language models, using data in Dutch, English, French and
German. We find that OCR noise poses a significant obstacle to language
modelling, with language models increasingly diverging from their noiseless
targets as OCR quality lowers. In the presence of small corpora, simpler models
including PPMI and Word2Vec consistently outperform transformer-based models in
this respect.
Related papers
- Spanish TrOCR: Leveraging Transfer Learning for Language Adaptation [0.0]
This study explores the transfer learning capabilities of the TrOCR architecture to Spanish.
We integrate an English TrOCR encoder with a language specific decoder and train the model on this specific language.
Fine-tuning the English TrOCR on Spanish yields superior recognition than the language specific decoder for a fixed dataset size.
arXiv Detail & Related papers (2024-07-09T15:31:41Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Ask Language Model to Clean Your Noisy Translation Data [7.246698449812031]
We focus on cleaning the noise from the target sentences in MTNT, making it more suitable as a benchmark for noise evaluation.
We show that large language models (LLMs) can effectively rephrase slang, jargon, and profanities.
Experiments on C-MTNT showcased its effectiveness in evaluating the robustness of NMT models.
arXiv Detail & Related papers (2023-10-20T13:05:32Z) - Extrapolating Multilingual Understanding Models as Multilingual
Generators [82.1355802012414]
This paper explores methods to empower multilingual understanding models the generation abilities to get a unified model.
We propose a textbfSemantic-textbfGuided textbfAlignment-then-Denoising (SGA) approach to adapt an encoder to a multilingual generator with a small number of new parameters.
arXiv Detail & Related papers (2023-05-22T15:33:21Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - Noisy Parallel Data Alignment [36.578851892373365]
We study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data.
Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.
arXiv Detail & Related papers (2023-01-23T19:26:34Z) - Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning [25.230786853723203]
We propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages.
We use Machine Translation to construct pseudo-parallel sentence pairs for low-resource languages.
We introduce a multi-view self-distillation method to learn noise-robust target-language representations.
arXiv Detail & Related papers (2022-08-26T09:32:24Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.