When MLLMs Meet Compression Distortion: A Coding Paradigm Tailored to MLLMs
- URL: http://arxiv.org/abs/2509.24258v1
- Date: Mon, 29 Sep 2025 04:07:52 GMT
- Title: When MLLMs Meet Compression Distortion: A Coding Paradigm Tailored to MLLMs
- Authors: Jinming Liu, Zhaoyang Jia, Jiahao Li, Bin Li, Xin Jin, Wenjun Zeng, Yan Lu,
- Abstract summary: We propose an image Codec TAilored to MLLMs (CoTAM) designed to adaptively protect multi-level features and suit different demands of downstream tasks.<n>Our method achieves up to 35.99% saving while maintaining the same performance on the MLLM tasks.
- Score: 38.29061845878822
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increasing deployment of powerful Multimodal Large Language Models (MLLMs), typically hosted on cloud platforms, urgently requires effective compression techniques to efficiently transmit signal inputs (e.g., images, videos) from edge devices with minimal bandwidth usage. However, conventional image codecs are optimized for fidelity to serve the Human Visual System (HVS) and ill-suited for MLLMs, in which diverse downstream tasks are jointly considered. In this paper, we first systematically analyze the impact of compression artifacts on several mainstream MLLMs. We find that: Compression distortion unevenly impacts different-level image features, leading to varying effects on MLLMs' downstream tasks depending on their feature-level reliance. Motivated by this discovery, we propose an image Codec TAilored to MLLMs (CoTAM) designed to adaptively protect multi-level features and suit different demands of downstream tasks. The encoder leverages CLIP's shallow-layer attention to generate an importance map for bit allocation, preserving critical semantic regions. Concurrently, the decoder integrates a lightweight adapter with a multi-level loss function to ensure the faithful reconstruction both of low-level details and high-level semantic context for robust synthesis of cross-level features. Extensive experiments validate that our method achieves up to 35.99\% bitrate saving while maintaining the same performance on the MLLM tasks, outperforming previous SOTA neural codecs.
Related papers
- Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models [34.12135666939555]
Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all layers.<n>We introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism.<n>ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance.
arXiv Detail & Related papers (2026-02-13T04:49:27Z) - CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding [24.71096142371054]
Large Language Models (LLMs) have achieved remarkable success in source code understanding.<n>As software systems grow in scale, computational efficiency has become a critical bottleneck.
arXiv Detail & Related papers (2026-02-02T08:10:21Z) - Benchmarking and Enhancing VLM for Compressed Image Understanding [52.98037879935058]
Vision-Language Models (VLMs) predominantly digest and understand high-bitrate compressed images.<n>Their ability to interpret low-bitrate compressed images has yet to be explored by far.<n>We introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images.
arXiv Detail & Related papers (2025-12-24T02:59:01Z) - A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z) - QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining [28.2730962800806]
We propose a drop-in replacement for CLIP vision encoders that can be seamlessly integrated with existing MLLMs.<n>QLIP improves the general visual question answering accuracy of the LLaVA v1.5 model series across various model sizes.<n> Notably, QLIP boosts detailed understanding performance on the challenging $Vast$ benchmark by up to 13.6 percent.
arXiv Detail & Related papers (2025-05-29T02:26:34Z) - FILA: Fine-Grained Vision Language Models [15.128058747088222]
HyViLM is designed to process images of any resolution while retaining the overall context during encoding.<n>Compared with the state-of-the-art MLLMs under the same setting, our HyViLM outperforms existing MLLMs in nine out of ten tasks.
arXiv Detail & Related papers (2024-12-11T13:41:21Z) - Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid [87.09900996643516]
We introduce a Complementary Image Pyramid (CIP) to mitigate semantic discontinuity during high-resolution image processing.
We also introduce a Scale Compression Mechanism (SCM) to reduce the additional computational overhead by compressing the redundant visual tokens.
Our experiments demonstrate that CIP can consistently enhance the performance across diverse architectures.
arXiv Detail & Related papers (2024-08-04T13:55:58Z) - Bridging Compressed Image Latents and Multimodal Large Language Models [45.83457913639876]
This paper presents the first-ever study of adapting compressed image latents to suit the needs of downstream vision tasks.<n> MLLMs have extended the success of large language models to modalities beyond text, but their billion scale hinders deployment on resource-constrained end devices.<n>We propose a novel framework with a lightweight transform-neck and a surrogate loss to adapt compressed image latents for MLLM-based vision tasks.
arXiv Detail & Related papers (2024-07-29T02:32:44Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.