Related papers: SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read

SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read

URL: http://arxiv.org/abs/2602.22426v1
Date: Wed, 25 Feb 2026 21:36:30 GMT
Title: SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read
Authors: Yibo Peng, Peng Xia, Ding Zhong, Kaide Zeng, Siwei Han, Yiyang Zhou, Jiaqi Liu, Ruiyi Zhang, Huaxiu Yao,
Abstract summary: We introduce the Visualized-Question (VQ) setting, where text queries are rendered directly onto images.<n>Despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting.<n>We propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process.
Score: 43.28273039987167
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely ``read'' text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated ``modality laziness.'' To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways. Empirically, SimpleOCR yields robust gains without architectural modifications. On four representative OOD benchmarks, it surpasses the base model by 5.4% and GRPO based on original images by 2.7%, while exhibiting extreme data efficiency, achieving superior performance with 30x fewer samples (8.5K) than recent RL-based methods. Furthermore, its plug-and-play nature allows seamless integration with advanced RL strategies like NoisyRollout to yield complementary improvements. Code is available at https://github.com/aiming-lab/SimpleOCR.

Related papers

Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models [2.1942030377331245]
Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream?<n>We investigate the OCR routing mechanism across three architecture families using causal interventions.
arXiv Detail & Related papers (2026-02-26T12:06:02Z)
LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR [0.29410438275861583]
We present textbfLightOnOCR-2-1B, a multilingual vision--language model that converts document images into clean, naturally ordered text without brittle OCR pipelines.<n>Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench.<n>We release model checkpoints under Apache 2.0, and publicly release the dataset and textbfLightOnOCR-bbox-bench evaluation under their respective licenses.
arXiv Detail & Related papers (2026-01-20T18:58:32Z)
Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization [50.13408999553116]
We propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation.<n>Our method uses a novel multi-objective reward that jointly optimize textual accuracy, code validity, and visualization quality.<n>Our results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation.
arXiv Detail & Related papers (2026-01-08T04:29:07Z)
ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z)
DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model [9.557159109747372]
Large vision-language models (LVLMs) are prone to hallucinations--generating words that do not exist in input images.<n>We propose DianJin-OCR-R1, a reasoning-and-tool interleaved VLMs trained on domain-specific datasets.
arXiv Detail & Related papers (2025-08-18T03:28:57Z)
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [95.89543460132413]
Vision-language models (VLMs) have improved performance by increasing the number of visual tokens.<n>However, most real-world scenarios do not require such an extensive number of visual tokens.<n>We present a new paradigm for visual token compression, namely, VisionThink.
arXiv Detail & Related papers (2025-07-17T17:59:55Z)
Text-Conditioned Resampler For Long Form Video Understanding [94.81955667020867]
We present a text-conditioned video resampler (TCR) module that uses a pre-trained visual encoder and large language model (LLM) TCR can process more than 100 frames at a time with plain attention and without optimised implementations.
arXiv Detail & Related papers (2023-12-19T06:42:47Z)
Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above. SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it. Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.