Related papers: DISGO: Automatic End-to-End Evaluation for Scene Text OCR

Related papers

PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy [14.50674472785442]
PreP-OCR is a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction.<n>We show that PreP-OCR reduces character error rates by 63.9-70.3% compared to OCR on raw images.
arXiv Detail & Related papers (2025-05-26T18:25:28Z)
Lost in OCR Translation? Vision-Based Approaches to Robust Document Retrieval [38.569818461453394]
Retrieval-Augmented Generation (RAG) is a technique for grounding responses in external documents.<n>Traditional RAG systems rely on Optical Character Recognition (OCR) to first process scanned documents into text.<n>Recent vision-language approaches, such as ColPali, propose direct visual embedding of documents, eliminating the need for OCR.
arXiv Detail & Related papers (2025-05-08T21:54:02Z)
TFIC: End-to-End Text-Focused Image Compression for Coding for Machines [50.86328069558113]
We present an image compression system designed to retain text-specific features for subsequent Optical Character Recognition (OCR) Our encoding process requires half the time needed by the OCR module, making it especially suitable for devices with limited computational capacity.
arXiv Detail & Related papers (2025-03-25T09:36:13Z)
Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway [0.2796197251957244]
We evaluate and improve OCR for text written in S'ami languages. Our results show that Transkribus and TrOCR outperform Tesseract on this task. We also show that fine-tuning pre-trained models and supplementing manual annotations can yield accurate OCR for S'ami languages.
arXiv Detail & Related papers (2025-01-13T13:07:51Z)
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy [50.78228433498211]
CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction. It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, and released for the first time. We evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition.
arXiv Detail & Related papers (2024-12-03T07:03:25Z)
DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models. We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z)
Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering [8.382903851560595]
Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content. Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems. We propose a multimodal adversarial training architecture with spatial awareness capabilities.
arXiv Detail & Related papers (2024-03-14T11:22:06Z)
EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package. It meets both the computational and sample efficiency requirements for liberating texts at scale. EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%. Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z)
End-to-End Page-Level Assessment of Handwritten Text Recognition [69.55992406968495]
HTR systems increasingly face the end-to-end page-level transcription of a document. Standard metrics do not take into account the inconsistencies that might appear. We propose a two-fold evaluation, where the transcription accuracy and the RO goodness are considered separately.
arXiv Detail & Related papers (2023-01-14T15:43:07Z)
PreSTU: Pre-Training for Scene-Text Understanding [49.288302725486226]
We propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU) PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content. We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.
arXiv Detail & Related papers (2022-09-12T18:29:55Z)
Donut: Document Understanding Transformer without OCR [17.397447819420695]
We propose a novel VDU model that is end-to-end trainable without underpinning OCR framework. Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets.
arXiv Detail & Related papers (2021-11-30T18:55:19Z)
TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text [23.04601165885908]
We propose TextOCR, an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images. We show that current state-of-the-art text-recognition (OCR) models fail to perform well on TextOCR. We use a TextOCR trained OCR model to create PixelM4C model which can do scene text based reasoning on an image in an end-to-end fashion.
arXiv Detail & Related papers (2021-05-12T07:50:42Z)
Scene Text Image Super-Resolution in the Wild [112.90416737357141]
Low-resolution text images are often seen in natural scenes such as documents captured by mobile phones. Previous single image super-resolution (SISR) methods are trained on synthetic low-resolution images. We pro-pose a real scene text SR dataset, termed TextZoom. It contains paired real low-resolution and high-resolution images captured by cameras with different focal length in the wild.
arXiv Detail & Related papers (2020-05-07T09:18:59Z)
Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting [49.768327669098674]
We propose an end-to-end trainable text spotting approach named Text Perceptron. It first employs an efficient segmentation-based text detector that learns the latent text reading order and boundary information. Then a novel Shape Transform Module (abbr. STM) is designed to transform the detected feature regions into regular morphologies.
arXiv Detail & Related papers (2020-02-17T08:07:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.