Optical Context Compression Is Just (Bad) Autoencoding
- URL: http://arxiv.org/abs/2512.03643v1
- Date: Wed, 03 Dec 2025 10:27:27 GMT
- Title: Optical Context Compression Is Just (Bad) Autoencoding
- Authors: Ivan Yee Lee, Cheng Yang, Taylor Berg-Kirkpatrick,
- Abstract summary: DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens.<n>We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR's reconstruction results are evidence that vision-based compression will be useful for language modeling.
- Score: 32.622769616423035
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens. This finding has sparked excitement about vision-based context compression for language models. But the evaluation stops at reconstruction; whether these representations help language modeling remains untested. We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR's reconstruction results are evidence that vision-based compression will be useful for language modeling. Comparing their vision encoder against simple alternatives--parameter-free mean pooling and a learned hierarchical encoder--we find that these simple approaches match or surpass vision for reconstruction at matched compression ratios, and outperform it for language modeling--where vision-based compression fails to beat truncation. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding
Related papers
- Global Context Compression with Interleaved Vision-Text Transformation [12.971394377165767]
In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages.<n>We propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding.<n>With a 4$times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks.
arXiv Detail & Related papers (2026-01-15T13:29:16Z) - Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR [25.00433693229684]
DeepSeek-OCR claims to decode text tokens exceeding ten times the input visual tokens.<n>We employ sentence-level and word-level semantic corruption to isolate the model's intrinsic OCR capabilities from its language priors.<n>We find that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods.
arXiv Detail & Related papers (2026-01-07T09:01:23Z) - Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking [8.189266513060621]
Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings.<n>Unlike text retrieval, where joint-encoder rerankers are standard, comparable vision--language rerankers are largely absent.<n>We introduce EDJE, an Efficient Discriminative Joint that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter.
arXiv Detail & Related papers (2025-10-08T09:46:09Z) - Embedding Compression Distortion in Video Coding for Machines [67.97469042910855]
Currently, video transmission serves not only the Human Visual System (HVS) for viewing but also machine perception for analysis.<n>We propose a Compression Distortion Embedding (CDRE) framework, which extracts machine-perception-related distortion representation and embeds it into downstream models.<n>Our framework can effectively boost the rate-task performance of existing codecs with minimal overhead in terms of execution time, and number of parameters.
arXiv Detail & Related papers (2025-03-27T13:01:53Z) - End-to-End Semantic Preservation in Text-Aware Image Compression Systems [42.76781276416154]
We present an end-to-end compression framework that retains text-specific features for Optical Character Recognition (OCR)<n> Experiments show significant improvements in text extraction accuracy at lows, even outperforming OCR on uncompressed images.<n>We extend this study to general-purpose encoders, exploring their capacity to preserve hidden semantics under extreme compression.
arXiv Detail & Related papers (2025-03-25T09:36:13Z) - Hierarchical Semantic Compression for Consistent Image Semantic Restoration [62.97519327310638]
We propose a novel hierarchical semantic compression (HSC) framework that purely operates within intrinsic semantic spaces from generative models.<n> Experimental results demonstrate that the proposed HSC framework achieves the state-of-the-art performance on subjective quality and consistency for human vision.
arXiv Detail & Related papers (2025-02-24T03:20:44Z) - Unicorn: Unified Neural Image Compression with One Number Reconstruction [25.79670851851377]
We propose an innovative paradigm, which we dub textbfUnicorn (textbfUnified textbfNeural textbfImage textbfCompression with textbfOne textbfNnumber textbfReconstruction)<n>By conceptualizing the images as index-image pairs and learning the inherent distribution of pairs in a subtle neural network, Unicorn can reconstruct a visually pleasing image from a randomly generated noise with only one index number.
arXiv Detail & Related papers (2024-12-11T08:59:04Z) - Cross Modal Compression: Towards Human-comprehensible Semantic
Compression [73.89616626853913]
Cross modal compression is a semantic compression framework for visual data.
We show that our proposed CMC can achieve encouraging reconstructed results with an ultrahigh compression ratio.
arXiv Detail & Related papers (2022-09-06T15:31:11Z) - The Devil Is in the Details: Window-based Attention for Image
Compression [58.1577742463617]
Most existing learned image compression models are based on Convolutional Neural Networks (CNNs)
In this paper, we study the effects of multiple kinds of attention mechanisms for local features learning, then introduce a more straightforward yet effective window-based local attention block.
The proposed window-based attention is very flexible which could work as a plug-and-play component to enhance CNN and Transformer models.
arXiv Detail & Related papers (2022-03-16T07:55:49Z) - Implicit Neural Representations for Image Compression [103.78615661013623]
Implicit Neural Representations (INRs) have gained attention as a novel and effective representation for various data types.
We propose the first comprehensive compression pipeline based on INRs including quantization, quantization-aware retraining and entropy coding.
We find that our approach to source compression with INRs vastly outperforms similar prior work.
arXiv Detail & Related papers (2021-12-08T13:02:53Z) - Modeling Lost Information in Lossy Image Compression [72.69327382643549]
Lossy image compression is one of the most commonly used operators for digital images.
We propose a novel invertible framework called Invertible Lossy Compression (ILC) to largely mitigate the information loss problem.
arXiv Detail & Related papers (2020-06-22T04:04:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.