Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway
- URL: http://arxiv.org/abs/2501.07300v1
- Date: Mon, 13 Jan 2025 13:07:51 GMT
- Title: Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway
- Authors: Tita Enstad, Trond Trosterud, Marie Iversdatter Røsok, Yngvil Beyer, Marie Roald,
- Abstract summary: We evaluate and improve OCR for text written in S'ami languages.
Our results show that Transkribus and TrOCR outperform Tesseract on this task.
We also show that fine-tuning pre-trained models and supplementing manual annotations can yield accurate OCR for S'ami languages.
- Score: 0.2796197251957244
- License:
- Abstract: Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the S\'ami documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in S\'ami languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing S\'ami texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for S\'ami languages, even with a moderate amount of manually annotated data.
Related papers
- Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts [0.0]
This study investigates the potential of Large Language Models (LLMs), particularly GPT-4o, for Optical Character Recognition (OCR) in low-resource scripts such as Urdu, Albanian, and Tajik.
Using a meticulously curated dataset of 2,520 images incorporating controlled variations in text length, font size, background color, and blur, the research simulates diverse real-world challenges.
arXiv Detail & Related papers (2024-12-20T18:05:22Z) - RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages [41.09752906121257]
We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR.
We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit.
We also present a novel approach for OCR error correction by leveraging techniques from machine translation.
arXiv Detail & Related papers (2024-12-14T19:59:41Z) - LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression.
We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols.
It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z) - EfficientOCR: An Extensible, Open-Source Package for Efficiently
Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package.
It meets both the computational and sample efficiency requirements for liberating texts at scale.
EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z) - TransDocs: Optical Character Recognition with word to word translation [2.2336243882030025]
This research work focuses on improving the optical character recognition (OCR) with ML techniques.
This work is based on ANKI dataset for English to Spanish translation.
arXiv Detail & Related papers (2023-04-15T21:40:14Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models [47.48019831416665]
We propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR.
TrOCR is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets.
Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
arXiv Detail & Related papers (2021-09-21T16:01:56Z) - TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped
scene text [23.04601165885908]
We propose TextOCR, an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images.
We show that current state-of-the-art text-recognition (OCR) models fail to perform well on TextOCR.
We use a TextOCR trained OCR model to create PixelM4C model which can do scene text based reasoning on an image in an end-to-end fashion.
arXiv Detail & Related papers (2021-05-12T07:50:42Z) - OCR Post Correction for Endangered Language Texts [113.8242302688894]
We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages.
We present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting.
We develop an OCR post-correction method tailored to ease training in this data-scarce setting.
arXiv Detail & Related papers (2020-11-10T21:21:08Z) - Scene Text Image Super-Resolution in the Wild [112.90416737357141]
Low-resolution text images are often seen in natural scenes such as documents captured by mobile phones.
Previous single image super-resolution (SISR) methods are trained on synthetic low-resolution images.
We pro-pose a real scene text SR dataset, termed TextZoom.
It contains paired real low-resolution and high-resolution images captured by cameras with different focal length in the wild.
arXiv Detail & Related papers (2020-05-07T09:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.