How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads
- URL: http://arxiv.org/abs/2505.15865v1
- Date: Wed, 21 May 2025 10:53:41 GMT
- Title: How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads
- Authors: Ingeol Baek, Hwan Chang, Sunghyun Ryu, Hwanhee Lee,
- Abstract summary: We identify the heads responsible for recognizing text from images, which we term the Optical Character Recognition Head (OCR Head)<n>Our findings regarding these heads are as follows: (1) Less Sparse: Unlike previous retrieval heads, a large number of heads are activated to extract textual information from images.<n>We validate our findings in downstream tasks by applying Chain-of-Thought (CoT) to both OCR and conventional retrieval heads and by masking these heads.
- Score: 3.6152232645741025
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite significant advancements in Large Vision Language Models (LVLMs), a gap remains, particularly regarding their interpretability and how they locate and interpret textual information within images. In this paper, we explore various LVLMs to identify the specific heads responsible for recognizing text from images, which we term the Optical Character Recognition Head (OCR Head). Our findings regarding these heads are as follows: (1) Less Sparse: Unlike previous retrieval heads, a large number of heads are activated to extract textual information from images. (2) Qualitatively Distinct: OCR heads possess properties that differ significantly from general retrieval heads, exhibiting low similarity in their characteristics. (3) Statically Activated: The frequency of activation for these heads closely aligns with their OCR scores. We validate our findings in downstream tasks by applying Chain-of-Thought (CoT) to both OCR and conventional retrieval heads and by masking these heads. We also demonstrate that redistributing sink-token values within the OCR heads improves performance. These insights provide a deeper understanding of the internal mechanisms LVLMs employ in processing embedded textual information in images.
Related papers
- InstructOCR: Instruction Boosting Scene Text Spotting [10.724187109801251]
InstructOCR is an innovative instruction-based scene text spotting model.<n>Our framework employs both text and image encoders during training and inference.<n>We achieve state-of-the-art results on widely used benchmarks.
arXiv Detail & Related papers (2024-12-20T03:23:26Z) - CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy [50.78228433498211]
CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction.<n>It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, and released for the first time.<n>We evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition.
arXiv Detail & Related papers (2024-12-03T07:03:25Z) - LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolution [67.23699927053191]
We propose a new framework called LLV-FSR, which marries the power of large vision-language model and higher-order visual prior with the challenging task of face super-resolution.
Experimental results demonstrate that our proposed framework significantly improves both the reconstruction quality and perceptual quality, surpassing the SOTA by 0.43dB in terms of PSNR on the MMCelebA-HQ dataset.
arXiv Detail & Related papers (2024-11-14T09:12:18Z) - See then Tell: Enhancing Key Information Extraction with Vision Grounding [54.061203106565706]
We introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding.
To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets.
arXiv Detail & Related papers (2024-09-29T06:21:05Z) - MARS: Paying more attention to visual attributes for text-based person search [6.438244172631555]
This paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive)
It enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss.
Experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements.
arXiv Detail & Related papers (2024-07-05T06:44:43Z) - Retrieval Head Mechanistically Explains Long-Context Factuality [56.78951509492645]
We show that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads.
We show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context.
We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.
arXiv Detail & Related papers (2024-04-24T00:24:03Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - CLIPTER: Looking at the Bigger Picture in Scene Text Recognition [10.561377899703238]
We harness the capabilities of modern vision-language models, such as CLIP, to provide scene-level information to the crop-based recognizer.
We achieve this by fusing a rich representation of the entire image, obtained from the vision-language model, with the recognizer word-level features via a cross-attention gated mechanism.
arXiv Detail & Related papers (2023-01-18T12:16:19Z) - Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models.
We show that human-generated captions show substantially higher quality than machine-generated ones.
We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.