Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity
- URL: http://arxiv.org/abs/2503.23667v1
- Date: Mon, 31 Mar 2025 02:09:19 GMT
- Title: Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity
- Authors: Kotaro Inoue,
- Abstract summary: multimodal Large Language Models (LLMs) have attracted significant attention across various industrial fields.<n>In this work, we examine a context-independent OCR task using single-character images with diverse visual complexities.<n>Our findings reveal that multimodal LLMs can match conventional OCR methods at about 300 ppi, yet their performance deteriorates significantly below 150 ppi.
- Score: 1.8130068086063336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to their high versatility in tasks such as image captioning, document analysis, and automated content generation, multimodal Large Language Models (LLMs) have attracted significant attention across various industrial fields. In particular, they have been shown to surpass specialized models in Optical Character Recognition (OCR). Nevertheless, their performance under different image conditions remains insufficiently investigated, and individual character recognition is not guaranteed due to their reliance on contextual cues. In this work, we examine a context-independent OCR task using single-character images with diverse visual complexities to determine the conditions for accurate recognition. Our findings reveal that multimodal LLMs can match conventional OCR methods at about 300 ppi, yet their performance deteriorates significantly below 150 ppi. Additionally, we observe a very weak correlation between visual complexity and misrecognitions, whereas a conventional OCR-specific model exhibits no correlation. These results suggest that image resolution and visual complexity may play an important role in the reliable application of multimodal LLMs to OCR tasks that require precise character-level accuracy.
Related papers
- CoLLM: A Large Language Model for Composed Image Retrieval [76.29725148964368]
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query.<n>We present CoLLM, a one-stop framework that generates triplets on-the-fly from image-caption pairs.<n>We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts.
arXiv Detail & Related papers (2025-03-25T17:59:50Z) - ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning [62.61187785810336]
ImageScope is a training-free, three-stage framework that unifies language-guided image retrieval tasks.
In the first stage, we improve the robustness of the framework by synthesizing search intents across varying levels of semantic granularity.
In the second and third stages, we reflect on retrieval results by verifying predicate propositions locally, and performing pairwise evaluations globally.
arXiv Detail & Related papers (2025-03-13T08:43:24Z) - Ocean-OCR: Towards General OCR Application via a Vision-Language Model [6.70908296002235]
We present textbfOcean-OCR, a 3B MLLM with state-of-the-art performance on various OCR scenarios and comparable understanding ability on general tasks.
We demonstrate the superiority of Ocean-OCR through comprehensive experiments on open-source OCR benchmarks and across various OCR scenarios.
arXiv Detail & Related papers (2025-01-26T15:20:39Z) - CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy [50.78228433498211]
CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction.<n>It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, and released for the first time.<n>We evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition.
arXiv Detail & Related papers (2024-12-03T07:03:25Z) - DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders.
This technique introduces a fusion network to unify the processing of outputs from different visual experts.
In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.