Related papers: Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR

Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR

URL: http://arxiv.org/abs/2601.03714v2
Date: Thu, 08 Jan 2026 08:37:59 GMT
Title: Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR
Authors: Yunhao Liang, Ruixuan Ying, Bo Li, Hong Li, Kai Yan, Qingwen Li, Min Yang, Okamoto Satoshi, Zhe Cui, Shiwen Ni,
Abstract summary: DeepSeek-OCR claims to decode text tokens exceeding ten times the input visual tokens.<n>We employ sentence-level and word-level semantic corruption to isolate the model's intrinsic OCR capabilities from its language priors.<n>We find that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods.
Score: 25.00433693229684
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: DeepSeek-OCR utilizes an optical 2D mapping approach to achieve high-ratio vision-text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long-context bottleneck, we investigate a critical question: "Visual merit or linguistic crutch - which drives DeepSeek-OCR's performance?" By employing sentence-level and word-level semantic corruption, we isolate the model's intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek-OCR's performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long-context bottleneck. This study empirically defines DeepSeek-OCR's capability boundaries and offers essential insights for future optimizations of the vision-text compression paradigm. We release all data, results and scripts used in this study at https://github.com/dududuck00/DeepSeekOCR.

Related papers

VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning [55.17170420615628]
Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks.<n>We propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process.<n>Our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency.
arXiv Detail & Related papers (2026-01-29T18:07:39Z)
Optical Context Compression Is Just (Bad) Autoencoding [32.622769616423035]
DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens.<n>We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR's reconstruction results are evidence that vision-based compression will be useful for language modeling.
arXiv Detail & Related papers (2025-12-03T10:27:27Z)
DeepSeek-OCR: Contexts Optical Compression [15.645614449208125]
We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping.<n>DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder.<n> Experiments show that when the number of text tokens is within 10 times that of vision tokens, the model can achieve decoding (OCR) precision of 97%.
arXiv Detail & Related papers (2025-10-21T02:41:44Z)
Glyph: Scaling Context Windows via Visual-Text Compression [91.20717058018745]
Glyph is a framework that renders long texts into images and processes them with vision-language models.<n>Our method achieves 3-4x token compression while maintaining accuracy comparable to leading long-context models.<n>Under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks.
arXiv Detail & Related papers (2025-10-20T17:58:56Z)
Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions [81.33113485830711]
We introduce a vision-free, single-encoder retrieval pipeline for vision-language models.<n>We migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions.<n>Our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks.
arXiv Detail & Related papers (2025-09-23T16:22:27Z)
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models [57.2662376527586]
VScan is a two-stage visual token reduction framework.<n>It addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model.<n>VScan achieves a 2.91$times$ speedup in prefilling and a 10$times$ reduction in FLOPs, while retaining 95.4% of the original performance.
arXiv Detail & Related papers (2025-05-28T17:59:08Z)
End-to-End Semantic Preservation in Text-Aware Image Compression Systems [42.76781276416154]
We present an end-to-end compression framework that retains text-specific features for Optical Character Recognition (OCR)<n> Experiments show significant improvements in text extraction accuracy at lows, even outperforming OCR on uncompressed images.<n>We extend this study to general-purpose encoders, exploring their capacity to preserve hidden semantics under extreme compression.
arXiv Detail & Related papers (2025-03-25T09:36:13Z)
Vision-centric Token Compression in Large Language Model [51.92055188780033]
Vision Centric Token Compression (Vist) is a slow-fast compression framework that mirrors human reading.<n>On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%.
arXiv Detail & Related papers (2025-02-02T13:10:06Z)
Improving Vision Anomaly Detection with the Guidance of Language Modality [64.53005837237754]
This paper tackles the challenges for vision modality from a multimodal point of view. We propose Cross-modal Guidance (CMG) to tackle the redundant information issue and sparse space issue. To learn a more compact latent space for the vision anomaly detector, CMLE learns a correlation structure matrix from the language modality.
arXiv Detail & Related papers (2023-10-04T13:44:56Z)
To show or not to show: Redacting sensitive text from videos of electronic displays [4.621328863799446]
We define an approach for redacting personally identifiable text from videos using a combination of optical character recognition (OCR) and natural language processing (NLP) techniques. We examine the relative performance of this approach when used with different OCR models, specifically Tesseract and the OCR system from Google Cloud Vision (GCV)
arXiv Detail & Related papers (2022-08-19T07:53:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.