Related papers: E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition

E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition

URL: http://arxiv.org/abs/2509.03615v1
Date: Wed, 03 Sep 2025 18:08:41 GMT
Title: E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition
Authors: Aryan Gupta, Anupam Purwar,
Abstract summary: In this work, we introduce Sprinklr-Edge-OCR, a novel OCR system built specifically optimized for edge deployment in resource-constrained environments.<n>We present a large-scale comparative evaluation of five state-of-the-art LVLMs (InternVL, Qwen, GOT OCR, LLaMA, MiniCPM) and two traditional OCR systems (Sprinklr-Edge-OCR, SuryaOCR) on a proprietary, doubly hand annotated dataset of multilingual (54 languages) images.<n>Our findings demonstrate that the most optimal OCR systems for edge deployment are the traditional ones due to their low compute requirements, low
Score: 3.186993645370078
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Optical Character Recognition (OCR) in multilingual, noisy, and diverse real-world images remains a significant challenge for optical character recognition systems. With the rise of Large Vision-Language Models (LVLMs), there is growing interest in their ability to generalize and reason beyond fixed OCR pipelines. In this work, we introduce Sprinklr-Edge-OCR, a novel OCR system built specifically optimized for edge deployment in resource-constrained environments. We present a large-scale comparative evaluation of five state-of-the-art LVLMs (InternVL, Qwen, GOT OCR, LLaMA, MiniCPM) and two traditional OCR systems (Sprinklr-Edge-OCR, SuryaOCR) on a proprietary, doubly hand annotated dataset of multilingual (54 languages) images. Our benchmark covers a broad range of metrics including accuracy, semantic consistency, language coverage, computational efficiency (latency, memory, GPU usage), and deployment cost. To better reflect real-world applicability, we also conducted edge case deployment analysis, evaluating model performance on CPU only environments. Among the results, Qwen achieved the highest precision (0.54), while Sprinklr-Edge-OCR delivered the best overall F1 score (0.46) and outperformed others in efficiency, processing images 35 faster (0.17 seconds per image on average) and at less than 0.01 of the cost (0.006 USD per 1,000 images) compared to LVLM. Our findings demonstrate that the most optimal OCR systems for edge deployment are the traditional ones even in the era of LLMs due to their low compute requirements, low latency, and very high affordability.

Related papers

SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read [43.28273039987167]
We introduce the Visualized-Question (VQ) setting, where text queries are rendered directly onto images.<n>Despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting.<n>We propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process.
arXiv Detail & Related papers (2026-02-25T21:36:30Z)
LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR [0.29410438275861583]
We present textbfLightOnOCR-2-1B, a multilingual vision--language model that converts document images into clean, naturally ordered text without brittle OCR pipelines.<n>Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench.<n>We release model checkpoints under Apache 2.0, and publicly release the dataset and textbfLightOnOCR-bbox-bench evaluation under their respective licenses.
arXiv Detail & Related papers (2026-01-20T18:58:32Z)
Efficient Perceptual Image Super Resolution: AIM 2025 Study and Benchmark [53.56717645904575]
We aim to replicate or improve the perceptual results of Real-ESRGAN while meeting strict efficiency constraints.<n>The proposed solutions were evaluated on a novel dataset consisting of 500 test images of 4K resolution, each degraded using multiple degradation types.<n>The top-performing approach manages to outperform Real-ESRGAN across all benchmark datasets.
arXiv Detail & Related papers (2025-10-14T17:45:22Z)
DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model [9.557159109747372]
Large vision-language models (LVLMs) are prone to hallucinations--generating words that do not exist in input images.<n>We propose DianJin-OCR-R1, a reasoning-and-tool interleaved VLMs trained on domain-specific datasets.
arXiv Detail & Related papers (2025-08-18T03:28:57Z)
MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios [66.59827827146262]
We introduce the MME-VideoOCR benchmark, which encompasses a comprehensive range of video OCR application scenarios.<n>The benchmark consists of 1,464 videos with varying resolutions, aspect ratios, and durations, along with 2,000 meticulously curated, manually annotated question-answer pairs.<n>We evaluate 18 state-of-the-art MLLMs on MME-VideoOCR, revealing that even the best-performing model (Gemini-2.5 Pro) achieves an accuracy of only 73.7%.
arXiv Detail & Related papers (2025-05-27T15:27:46Z)
Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues? [73.35232225256968]
Reasoning-OCR challenges LMMs to solve complex reasoning problems based on the cues that can be extracted from rich visual-text.<n>Our evaluation offers some insights for proprietary and open-source LMMs in different reasoning challenges.
arXiv Detail & Related papers (2025-05-19T06:45:18Z)
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly [77.43867473323566]
Long-context vision-language models (LCVLMs) are capable of handling hundreds of images with interleaved text tokens in a single forward pass.<n> MMLongBench is the first benchmark covering a diverse set of long-context vision-language tasks.
arXiv Detail & Related papers (2025-05-15T17:52:54Z)
A Lightweight Multi-Module Fusion Approach for Korean Character Recognition [0.0]
SDA-Net is a lightweight and efficient architecture for robust single-character recognition.<n>It achieves state-of-the-art accuracy on challenging OCR benchmarks, with significantly faster inference.
arXiv Detail & Related papers (2025-04-08T07:50:19Z)
Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments [3.5936169218390703]
This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments.<n>We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements.
arXiv Detail & Related papers (2025-02-10T13:20:19Z)
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning [72.57452266982642]
OCRBench v2 is a large-scale bilingual text-centric benchmark.<n>It covers 31 diverse scenarios, 10,000 human-verified question-answering pairs, and thorough evaluation metrics.<n>We find that most LMMs score below 50 (100 in total) and suffer from five-type limitations.
arXiv Detail & Related papers (2024-12-31T07:32:35Z)
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy [50.78228433498211]
CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction.<n>It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, and released for the first time.<n>We evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition.
arXiv Detail & Related papers (2024-12-03T07:03:25Z)
EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package. It meets both the computational and sample efficiency requirements for liberating texts at scale. EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.