VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
- URL: http://arxiv.org/abs/2512.15701v1
- Date: Wed, 17 Dec 2025 18:52:55 GMT
- Title: VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
- Authors: Kyle Sargent, Ruiqi Gao, Philipp Henzler, Charles Herrmann, Aleksander Holynski, Li Fei-Fei, Jiajun Wu, Jason Zhang,
- Abstract summary: Vision-Language Models for Image Compression (VLIC) is a diffusion-based image compression system designed to be post-trained with binary VLM judgments.<n>We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset.
- Score: 83.36460501519203
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at https://kylesargent.github.io/vlic
Related papers
- Benchmarking and Enhancing VLM for Compressed Image Understanding [52.98037879935058]
Vision-Language Models (VLMs) predominantly digest and understand high-bitrate compressed images.<n>Their ability to interpret low-bitrate compressed images has yet to be explored by far.<n>We introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images.
arXiv Detail & Related papers (2025-12-24T02:59:01Z) - ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution [71.69364653858447]
Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs.<n>We propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying complexities using different numbers of vision tokens.<n> Experimental results demonstrate that our method can reduce the number of vision tokens by up to 50% while maintaining the model's perception, reasoning, and OCR capabilities.
arXiv Detail & Related papers (2025-10-14T17:58:10Z) - ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z) - Guided Diffusion for the Extension of Machine Vision to Human Visual Perception [0.0]
We propose a method for extending machine vision to human visual perception using guided diffusion.<n> Guided diffusion acts as a bridge between machine vision and human perception, enabling transitions between them without any additional overhead.
arXiv Detail & Related papers (2025-03-23T03:04:26Z) - Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization [28.089274647643716]
FlowMo is a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates.<n>Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage.
arXiv Detail & Related papers (2025-03-14T03:49:17Z) - Predicting Satisfied User and Machine Ratio for Compressed Images: A Unified Approach [58.71009078356928]
We create a deep learning-based model to predict Satisfied User Ratio (SUR) and Satisfied Machine Ratio (SMR) of compressed images simultaneously.<n> Experimental results indicate that the proposed model significantly outperforms state-of-the-art SUR and SMR prediction methods.
arXiv Detail & Related papers (2024-12-23T11:09:30Z) - V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization [21.248617886995103]
We propose Vision-guided Direct Preference Optimization (V-DPO) to enhance visual context learning at training time.
Our analysis indicates that V-DPO excels in learning from image-contrast preference data, demonstrating its superior ability to elicit and understand nuances of visual context.
arXiv Detail & Related papers (2024-11-05T01:24:37Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Visual Data-Type Understanding does not emerge from Scaling
Vision-Language Models [31.69213233651326]
We introduce the novel task of Visual Data-Type Identification.
An extensive zero-shot evaluation of 39 vision-language models (VLMs) shows a nuanced performance landscape.
arXiv Detail & Related papers (2023-10-12T17:59:30Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.