Controlling Vision-Language Models for Multi-Task Image Restoration
- URL: http://arxiv.org/abs/2310.01018v2
- Date: Wed, 28 Feb 2024 14:28:51 GMT
- Title: Controlling Vision-Language Models for Multi-Task Image Restoration
- Authors: Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sj\"olund, Thomas
B. Sch\"on
- Abstract summary: We present a degradation-aware vision-language model (DA-CLIP) to better transfer pretrained vision-language models to low-level vision tasks.
Our approach advances state-of-the-art performance on both emphdegradation-specific and emphunified image restoration tasks.
- Score: 6.239038964461397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language models such as CLIP have shown great impact on diverse
downstream tasks for zero-shot or label-free predictions. However, when it
comes to low-level vision such as image restoration their performance
deteriorates dramatically due to corrupted inputs. In this paper, we present a
degradation-aware vision-language model (DA-CLIP) to better transfer pretrained
vision-language models to low-level vision tasks as a multi-task framework for
image restoration. More specifically, DA-CLIP trains an additional controller
that adapts the fixed CLIP image encoder to predict high-quality feature
embeddings. By integrating the embedding into an image restoration network via
cross-attention, we are able to pilot the model to learn a high-fidelity image
reconstruction. The controller itself will also output a degradation feature
that matches the real corruptions of the input, yielding a natural classifier
for different degradation types. In addition, we construct a mixed degradation
dataset with synthetic captions for DA-CLIP training. Our approach advances
state-of-the-art performance on both \emph{degradation-specific} and
\emph{unified} image restoration tasks, showing a promising direction of
prompting image restoration with large-scale pretrained vision-language models.
Our code is available at https://github.com/Algolzw/daclip-uir.
Related papers
- QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation [101.28446308930367]
Quantized Language-Image Pretraining (QLIP) combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding.
QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives.
We demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.
arXiv Detail & Related papers (2025-02-07T18:59:57Z) - Review Learning: Advancing All-in-One Ultra-High-Definition Image Restoration Training Method [7.487270862599671]
We propose a new training paradigm for general image restoration models, which we name bfReview Learning.
This approach begins with sequential training of an image restoration model on several degraded datasets, combined with a review mechanism.
We design a lightweight all-purpose image restoration network that can efficiently reason about degraded images with 4K resolution on a single consumer-grade GPU.
arXiv Detail & Related papers (2024-08-13T08:08:45Z) - Photo-Realistic Image Restoration in the Wild with Controlled Vision-Language Models [14.25759541950917]
This work leverages a capable vision-language model and a synthetic degradation pipeline to learn image restoration in the wild (wild IR)
Our base diffusion model is the image restoration SDE (IR-SDE)
arXiv Detail & Related papers (2024-04-15T12:34:21Z) - InstructIR: High-Quality Image Restoration Following Human Instructions [61.1546287323136]
We present the first approach that uses human-written instructions to guide the image restoration model.
Our method, InstructIR, achieves state-of-the-art results on several restoration tasks.
arXiv Detail & Related papers (2024-01-29T18:53:33Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - PromptIR: Prompting for All-in-One Blind Image Restoration [64.02374293256001]
We present a prompt-based learning approach, PromptIR, for All-In-One image restoration.
Our method uses prompts to encode degradation-specific information, which is then used to dynamically guide the restoration network.
PromptIR offers a generic and efficient plugin module with few lightweight prompts.
arXiv Detail & Related papers (2023-06-22T17:59:52Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Restormer: Efficient Transformer for High-Resolution Image Restoration [118.9617735769827]
convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data.
Transformers have shown significant performance gains on natural language and high-level vision tasks.
Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks.
arXiv Detail & Related papers (2021-11-18T18:59:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.