Related papers: Benchmarking and Enhancing VLM for Compressed Image Understanding

Benchmarking and Enhancing VLM for Compressed Image Understanding

URL: http://arxiv.org/abs/2512.20901v1
Date: Wed, 24 Dec 2025 02:59:01 GMT
Title: Benchmarking and Enhancing VLM for Compressed Image Understanding
Authors: Zifu Zhang, Tongda Xu, Siqi Li, Shengxi Li, Yue Zhang, Mai Xu, Yan Wang,
Abstract summary: Vision-Language Models (VLMs) predominantly digest and understand high-bitrate compressed images.<n>Their ability to interpret low-bitrate compressed images has yet to be explored by far.<n>We introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images.
Score: 52.98037879935058
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images.

Related papers

Plug-and-Play Versatile Compressed Video Enhancement [57.62582951699999]
Video compression effectively reduces the size of files, making it possible for real-time cloud computing.<n>However, it comes at the cost of visual quality, challenges the robustness of downstream vision models.<n>We present a versatile-aware enhancement framework that adaptively enhance videos under different compression settings.
arXiv Detail & Related papers (2025-04-21T18:39:31Z)
Large Language Model for Lossless Image Compression with Visual Prompts [26.132381529841815]
This paper introduces a novel paradigm for lossless image compression that incorporates Large Language Models with visual prompts.<n>Experiments on multiple benchmark datasets demonstrate our method achieves state-of-the-art compression performance.<n>Our approach can be easily extended to images from other domains, such as medical and screen content images, achieving impressive performance.
arXiv Detail & Related papers (2025-02-22T09:36:03Z)
UniMIC: Towards Universal Multi-modality Perceptual Image Compression [21.370591256689885]
We present UniMIC, a universal multi-modality image compression framework.<n>UniMIC aims to unify the rate-distortion-perception (RDP) optimization for multiple image codecs.
arXiv Detail & Related papers (2024-12-06T10:08:55Z)
Large Language Models for Lossless Image Compression: Next-Pixel Prediction in Language Space is All You Need [53.584140947828004]
Language large model (LLM) with unprecedented intelligence is a general-purpose lossless compressor for various data modalities. We propose P$2$-LLM, a next-pixel prediction-based LLM, which integrates various elaborated insights and methodologies. Experiments on benchmark datasets demonstrate that P$2$-LLM can beat SOTA classical and learned codecs.
arXiv Detail & Related papers (2024-11-19T12:15:40Z)
MISC: Ultra-low Bitrate Image Semantic Compression Driven by Large Multimodal Model [78.4051835615796]
This paper proposes a method called Multimodal Image Semantic Compression. It consists of an LMM encoder for extracting the semantic information of the image, a map encoder to locate the region corresponding to the semantic, an image encoder generates an extremely compressed bitstream, and a decoder reconstructs the image based on the above information. It can achieve optimal consistency and perception results while saving perceptual 50%, which has strong potential applications in the next generation of storage and communication.
arXiv Detail & Related papers (2024-02-26T17:11:11Z)
Progressive Learning with Visual Prompt Tuning for Variable-Rate Image Compression [60.689646881479064]
We propose a progressive learning paradigm for transformer-based variable-rate image compression. Inspired by visual prompt tuning, we use LPM to extract prompts for input images and hidden features at the encoder side and decoder side, respectively. Our model outperforms all current variable image methods in terms of rate-distortion performance and approaches the state-of-the-art fixed image compression methods trained from scratch.
arXiv Detail & Related papers (2023-11-23T08:29:32Z)
You Can Mask More For Extremely Low-Bitrate Image Compression [80.7692466922499]
Learned image compression (LIC) methods have experienced significant progress during recent years. LIC methods fail to explicitly explore the image structure and texture components crucial for image compression. We present DA-Mask that samples visible patches based on the structure and texture of original images. We propose a simple yet effective masked compression model (MCM), the first framework that unifies LIC and LIC end-to-end for extremely low-bitrate compression.
arXiv Detail & Related papers (2023-06-27T15:36:22Z)
Learned Multi-Resolution Variable-Rate Image Compression with Octave-based Residual Blocks [15.308823742699039]
We propose a new variable-rate image compression framework, which employs generalized octave convolutions (GoConv) and generalized octave transposed-convolutions (GoTConv) To enable a single model to operate with different bit rates and to learn multi-rate image features, a new objective function is introduced. Experimental results show that the proposed framework trained with variable-rate objective function outperforms the standard codecs such as H.265/HEVC-based BPG and state-of-the-art learning-based variable-rate methods.
arXiv Detail & Related papers (2020-12-31T06:26:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.