Vision-Language Model Guided Image Restoration
- URL: http://arxiv.org/abs/2512.17292v1
- Date: Fri, 19 Dec 2025 07:16:07 GMT
- Title: Vision-Language Model Guided Image Restoration
- Authors: Cuixin Yang, Rongkang Dong, Kin-Man Lam,
- Abstract summary: Vision-language models (VLMs) excel at aligning visual and textual features into universal image restoration.<n>We propose the Vision-Language Model Guided Image Restoration (VLMIR) framework to enhance IR performance through improved visual perception and semantic understanding.<n>Our approach consists of two stages: VLM-based feature extraction and diffusion-based image restoration.
- Score: 16.151927651999948
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many image restoration (IR) tasks require both pixel-level fidelity and high-level semantic understanding to recover realistic photos with fine-grained details. However, previous approaches often struggle to effectively leverage both the visual and linguistic knowledge. Recent efforts have attempted to incorporate Vision-language models (VLMs), which excel at aligning visual and textual features, into universal IR. Nevertheless, these methods fail to utilize the linguistic priors to ensure semantic coherence during the restoration process. To address this issue, in this paper, we propose the Vision-Language Model Guided Image Restoration (VLMIR) framework, which leverages the rich vision-language priors of VLMs, such as CLIP, to enhance IR performance through improved visual perception and semantic understanding. Our approach consists of two stages: VLM-based feature extraction and diffusion-based image restoration. In the first stage, we extract complementary visual and linguistic representations of input images by condensing the visual perception and high-level semantic priors through VLMs. Specifically, we align the embeddings of captions from low-quality and high-quality images using a cosine similarity loss with LoRA fine-tuning, and employ a degradation predictor to decompose degradation and clean image content embeddings. These complementary visual and textual embeddings are then integrated into a diffusion-based model via cross-attention mechanisms for enhanced restoration. Extensive experiments and ablation studies demonstrate that VLMIR achieves superior performance across both universal and degradation-specific IR tasks, underscoring the critical role of integrated visual and linguistic knowledge from VLMs in advancing image restoration capabilities.
Related papers
- Attention Guided Alignment in Efficient Vision-Language Models [56.20286899428444]
Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs)<n>This paper presents a comprehensive analysis of attention patterns in efficient VLMs.<n>We introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers.
arXiv Detail & Related papers (2025-11-21T21:36:48Z) - Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models [9.24989979549793]
Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks.<n>These models typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision.<n>We introduce an instruction-agnostic token compression algorithm based on a plug-and-play visual decoder to improve decoding efficiency.
arXiv Detail & Related papers (2025-09-23T16:07:18Z) - ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z) - Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better [44.15671594378141]
We introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework.<n>ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks.
arXiv Detail & Related papers (2025-06-10T17:57:50Z) - Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection [53.558449071113245]
Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM)<n>Recent advancements in vision-language modeling introduce image cropping techniques that feed all encoded sub-images into the model.<n>We propose a lightweight, universal framework that seamlessly integrates with existing VLMs to enhance their ability to process finegrained details.
arXiv Detail & Related papers (2025-03-14T18:33:31Z) - Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses [31.85977999591524]
Vision-Language Models implicitly learn to associate image regions with words from large-scale training data.<n>Rich semantic and syntactic structures within the text modality have been overlooked as sources of supervision.<n>Hierarchically STructured Learning (HIST) enhances spatial vision-language alignment without using additional human annotations.
arXiv Detail & Related papers (2024-12-11T05:36:18Z) - Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training [30.071860810401933]
This paper advances contrastive language-image pre-training (CLIP) into one novel holistic paradigm.<n>We use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies.<n>Our holistic CLIP significantly outperforms existing CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.
arXiv Detail & Related papers (2024-11-30T11:27:58Z) - LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolution [67.23699927053191]
We propose a new framework called LLV-FSR, which marries the power of large vision-language model and higher-order visual prior with the challenging task of face super-resolution.
Experimental results demonstrate that our proposed framework significantly improves both the reconstruction quality and perceptual quality, surpassing the SOTA by 0.43dB in terms of PSNR on the MMCelebA-HQ dataset.
arXiv Detail & Related papers (2024-11-14T09:12:18Z) - SSP-IR: Semantic and Structure Priors for Diffusion-based Realistic Image Restoration [20.873676111265656]
SSP-IR aims to fully exploit semantic and structure priors from low-quality images.<n>Our method outperforms other state-of-the-art methods overall on both synthetic and real-world datasets.
arXiv Detail & Related papers (2024-07-04T04:55:14Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - SPIRE: Semantic Prompt-Driven Image Restoration [66.26165625929747]
We develop SPIRE, a Semantic and restoration Prompt-driven Image Restoration framework.
Our approach is the first framework that supports fine-level instruction through language-based quantitative specification of the restoration strength.
Our experiments demonstrate the superior restoration performance of SPIRE compared to the state of the arts.
arXiv Detail & Related papers (2023-12-18T17:02:30Z) - VLMAE: Vision-Language Masked Autoencoder [21.97700040013084]
We propose a vision-language masked autoencoder framework (VLMAE) for vision-language pre-training.
VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features.
arXiv Detail & Related papers (2022-08-19T14:39:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.