Related papers: RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement

RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement

URL: http://arxiv.org/abs/2404.01889v3
Date: Sat, 20 Jul 2024 22:57:08 GMT
Title: RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement
Authors: Tatiana Gaintseva, Martin Benning, Gregory Slabaugh,
Abstract summary: We propose a novel modification of Contrastive Language-Image Pre-Training (CLIP) guidance for the task of unsupervised backlit image enhancement. Our work builds on the state-of-the-art CLIP-LIT approach, which learns a prompt pair by constraining the text-image similarity between a prompt (negative/positive sample) and a corresponding image (backlit image/well-lit image) in the CLIP embedding space. We show that instead of tuning prompts in the space of text embeddings, it is possible to directly tune their embeddings in the latent space without any loss in quality
Score: 0.24578723416255752
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper we propose a novel modification of Contrastive Language-Image Pre-Training (CLIP) guidance for the task of unsupervised backlit image enhancement. Our work builds on the state-of-the-art CLIP-LIT approach, which learns a prompt pair by constraining the text-image similarity between a prompt (negative/positive sample) and a corresponding image (backlit image/well-lit image) in the CLIP embedding space. Learned prompts then guide an image enhancement network. Based on the CLIP-LIT framework, we propose two novel methods for CLIP guidance. First, we show that instead of tuning prompts in the space of text embeddings, it is possible to directly tune their embeddings in the latent space without any loss in quality. This accelerates training and potentially enables the use of additional encoders that do not have a text encoder. Second, we propose a novel approach that does not require any prompt tuning. Instead, based on CLIP embeddings of backlit and well-lit images from training data, we compute the residual vector in the embedding space as a simple difference between the mean embeddings of the well-lit and backlit images. This vector then guides the enhancement network during training, pushing a backlit image towards the space of well-lit images. This approach further dramatically reduces training time, stabilizes training and produces high quality enhanced images without artifacts, both in supervised and unsupervised training regimes. Additionally, we show that residual vectors can be interpreted, revealing biases in training data, and thereby enabling potential bias correction.

Related papers

Implicit Inversion turns CLIP into a Decoder [15.428694454730541]
We show that image synthesis is possible using CLIP alone -- without any decoder, training, or fine-tuning.<n>Our approach optimize a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying across network layers.<n>Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction.
arXiv Detail & Related papers (2025-05-29T06:55:26Z)
CURVE: CLIP-Utilized Reinforcement Learning for Visual Image Enhancement via Simple Image Processing [0.5803309695504829]
Low-Light Image Enhancement (LLIE) is crucial for improving both human perception and computer vision tasks.<n>This paper addresses two challenges in zero-reference LLIE: obtaining perceptually 'good' images and maintaining computational efficiency for high-resolution images.<n>We propose CLIP-Utilized Reinforcement learning-based Visual image Enhancement (CURVE)
arXiv Detail & Related papers (2025-05-29T05:09:13Z)
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution. We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z)
Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. We introduce a novel method named Decoder Pre-training with only text for STR (DPTR) DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z)
CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality. We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus. CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z)
The CLIP Model is Secretly an Image-to-Prompt Converter [26.92989288717742]
The paper demonstrates that the CLIP model, as utilized in Stable Diffusion, inherently possesses the ability to instantaneously convert images into text prompts. Such an image-to-prompt conversion can be achieved by utilizing a linear projection matrix that is calculated in a closed form.
arXiv Detail & Related papers (2023-05-22T04:52:12Z)
Iterative Prompt Learning for Unsupervised Backlit Image Enhancement [86.90993077000789]
We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-LIT. We show that the open-world CLIP prior aids in distinguishing between backlit and well-lit images. Our method alternates between updating the prompt learning framework and enhancement network until visually pleasing results are achieved.
arXiv Detail & Related papers (2023-03-30T17:37:14Z)
CLIP2GAN: Towards Bridging Text with the Latent Space of GANs [128.47600914674985]
We propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN. The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN.
arXiv Detail & Related papers (2022-11-28T04:07:17Z)
clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP [1.3733526575192976]
We introduce a new method to efficiently create text-to-image models from a pre-trained CLIP and StyleGAN. It enables text driven sampling with an existing generative model without any external data or fine-tuning. We leverage the alignment between CLIP's image and text embeddings to avoid the need for any text labelled data for training the conditional diffusion model.
arXiv Detail & Related papers (2022-10-05T15:49:41Z)
No Token Left Behind: Explainability-Aided Image Classification and Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input. Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.