Related papers: DeepSeek-OCR: Contexts Optical Compression

DeepSeek-OCR: Contexts Optical Compression

URL: http://arxiv.org/abs/2510.18234v1
Date: Tue, 21 Oct 2025 02:41:44 GMT
Title: DeepSeek-OCR: Contexts Optical Compression
Authors: Haoran Wei, Yaofeng Sun, Yukun Li,
Abstract summary: We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping.<n>DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder.<n> Experiments show that when the number of text tokens is within 10 times that of vision tokens, the model can achieve decoding (OCR) precision of 97%.
Score: 15.645614449208125
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.

Related papers

VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning [55.17170420615628]
Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks.<n>We propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process.<n>Our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency.
arXiv Detail & Related papers (2026-01-29T18:07:39Z)
Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR [25.00433693229684]
DeepSeek-OCR claims to decode text tokens exceeding ten times the input visual tokens.<n>We employ sentence-level and word-level semantic corruption to isolate the model's intrinsic OCR capabilities from its language priors.<n>We find that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods.
arXiv Detail & Related papers (2026-01-07T09:01:23Z)
Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction [19.234118544637592]
Long-LRM++ is a model that adopts a semi-explicit scene representation combined with a lightweight decoder.<n>Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU.<n>Our design also scales to 64 input views at the $950times540$ resolution, demonstrating strong generalization to increased input lengths.
arXiv Detail & Related papers (2025-12-11T04:10:21Z)
Context Cascade Compression: Exploring the Upper Limits of Text Compression [3.013064618174921]
We introduce Context Cascade Compression C3 to explore the upper limits of text compression.<n>At a 20x compression ratio, our model achieves 98% decoding accuracy, compared to approximately 60% for DeepSeek-OCR.<n>This indicates that in the domain of context compression, C3 Compression demonstrates superior performance and feasibility over optical character compression.
arXiv Detail & Related papers (2025-11-19T09:02:56Z)
Glyph: Scaling Context Windows via Visual-Text Compression [91.20717058018745]
Glyph is a framework that renders long texts into images and processes them with vision-language models.<n>Our method achieves 3-4x token compression while maintaining accuracy comparable to leading long-context models.<n>Under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks.
arXiv Detail & Related papers (2025-10-20T17:58:56Z)
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs [82.72388893596555]
Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks.<n>Previous token compression techniques are often constrained by rules that risk discarding critical information.<n>We reformulate token compression as a lightweight plug-and-play framework that reformulates token compression into an end-to-end learnable decision process.
arXiv Detail & Related papers (2025-10-18T17:54:18Z)
R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search [61.4807238517108]
Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving.<n>CoT's extension to Long-CoT introduces substantial computational overhead due to increased token length.<n>We propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence.
arXiv Detail & Related papers (2025-05-22T16:06:59Z)
End-to-End Semantic Preservation in Text-Aware Image Compression Systems [42.76781276416154]
We present an end-to-end compression framework that retains text-specific features for Optical Character Recognition (OCR)<n> Experiments show significant improvements in text extraction accuracy at lows, even outperforming OCR on uncompressed images.<n>We extend this study to general-purpose encoders, exploring their capacity to preserve hidden semantics under extreme compression.
arXiv Detail & Related papers (2025-03-25T09:36:13Z)
Vision-centric Token Compression in Large Language Model [51.92055188780033]
Vision Centric Token Compression (Vist) is a slow-fast compression framework that mirrors human reading.<n>On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%.
arXiv Detail & Related papers (2025-02-02T13:10:06Z)
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model [22.834085739828815]
We propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks.
arXiv Detail & Related papers (2024-09-03T08:41:31Z)
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model [118.06260386652778]
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs.
arXiv Detail & Related papers (2024-05-07T15:56:43Z)
An end-to-end Optical Character Recognition approach for ultra-low-resolution printed text images [0.0]
We present a novel method for performing optical character recognition (OCR) on low-resolution images. This approach is inspired from our understanding of the human visual system, and builds on established neural networks for performing OCR. We achieve a mean character level accuracy (CLA) of 99.7% and word level accuracy (WLA) of 98.9% across a set of about 1000 pages of 60 dpi text.
arXiv Detail & Related papers (2021-05-10T17:08:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.