CURVE: CLIP-Utilized Reinforcement Learning for Visual Image Enhancement via Simple Image Processing
- URL: http://arxiv.org/abs/2505.23102v2
- Date: Tue, 08 Jul 2025 14:27:04 GMT
- Title: CURVE: CLIP-Utilized Reinforcement Learning for Visual Image Enhancement via Simple Image Processing
- Authors: Yuka Ogino, Takahiro Toizumi, Atsushi Ito,
- Abstract summary: Low-Light Image Enhancement (LLIE) is crucial for improving both human perception and computer vision tasks.<n>This paper addresses two challenges in zero-reference LLIE: obtaining perceptually 'good' images and maintaining computational efficiency for high-resolution images.<n>We propose CLIP-Utilized Reinforcement learning-based Visual image Enhancement (CURVE)
- Score: 0.5803309695504829
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Low-Light Image Enhancement (LLIE) is crucial for improving both human perception and computer vision tasks. This paper addresses two challenges in zero-reference LLIE: obtaining perceptually 'good' images using the Contrastive Language-Image Pre-Training (CLIP) model and maintaining computational efficiency for high-resolution images. We propose CLIP-Utilized Reinforcement learning-based Visual image Enhancement (CURVE). CURVE employs a simple image processing module which adjusts global image tone based on B\'ezier curve and estimates its processing parameters iteratively. The estimator is trained by reinforcement learning with rewards designed using CLIP text embeddings. Experiments on low-light and multi-exposure datasets demonstrate the performance of CURVE in terms of enhancement quality and processing speed compared to conventional methods.
Related papers
- ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval [83.01358520910533]
We introduce a new framework that can boost the performance of large-scale pre-trained vision- curation models.<n>The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple mapping network, to predict a set of visual prompts.<n>ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks.
arXiv Detail & Related papers (2025-02-21T18:59:57Z) - Leveraging Content and Context Cues for Low-Light Image Enhancement [25.97198463881292]
Low-light conditions have an adverse impact on machine cognition, limiting the performance of computer vision systems in real life.<n>We propose to improve the existing zero-reference low-light enhancement by leveraging the CLIP model to capture image prior and for semantic guidance.<n>We experimentally show, that the proposed prior and semantic guidance help to improve the overall image contrast and hue, as well as improve background-foreground discrimination.
arXiv Detail & Related papers (2024-12-10T17:32:09Z) - Ranking-aware adapter for text-driven image ordering with CLIP [76.80965830448781]
We propose an effective yet efficient approach that reframes the CLIP model into a learning-to-rank task.<n>Our approach incorporates learnable prompts to adapt to new instructions for ranking purposes.<n>Our ranking-aware adapter consistently outperforms fine-tuned CLIPs on various tasks.
arXiv Detail & Related papers (2024-12-09T18:51:05Z) - FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance [7.041364616661048]
Foveal-Attention CLIP (FALIP) adjusts the CLIP's attention by inserting foveal attention masks into the multi-head self-attention module.
FALIP effectively boosts CLIP zero-shot performance in tasks such as referring expressions comprehension, image classification, and 3D point cloud recognition.
arXiv Detail & Related papers (2024-07-08T03:23:13Z) - CLIP Guided Image-perceptive Prompt Learning for Image Enhancement [15.40368082025006]
Contrastive Language-Image Pre-Training (CLIP) Guided Prompt Learning is proposed.
We learn image-perceptive prompts to distinguish between original and target images using CLIP model.
We introduce a very simple network by incorporating a simple baseline to predict the weights of three different LUT as enhancement network.
arXiv Detail & Related papers (2023-11-07T12:36:20Z) - Composed Image Retrieval using Contrastive Learning and Task-oriented
CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one.
We use features from the OpenAI CLIP model to tackle the considered task.
We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z) - Iterative Prompt Learning for Unsupervised Backlit Image Enhancement [86.90993077000789]
We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-LIT.
We show that the open-world CLIP prior aids in distinguishing between backlit and well-lit images.
Our method alternates between updating the prompt learning framework and enhancement network until visually pleasing results are achieved.
arXiv Detail & Related papers (2023-03-30T17:37:14Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Real-World Image Super-Resolution by Exclusionary Dual-Learning [98.36096041099906]
Real-world image super-resolution is a practical image restoration problem that aims to obtain high-quality images from in-the-wild input.
Deep learning-based methods have achieved promising restoration quality on real-world image super-resolution datasets.
We propose Real-World image Super-Resolution by Exclusionary Dual-Learning (RWSR-EDL) to address the feature diversity in perceptual- and L1-based cooperative learning.
arXiv Detail & Related papers (2022-06-06T13:28:15Z) - VL-LTR: Learning Class-wise Visual-Linguistic Representation for
Long-Tailed Visual Recognition [61.75391989107558]
We present a visual-linguistic long-tailed recognition framework, termed VL-LTR.
Our method can learn visual representation from images and corresponding linguistic representation from noisy class-level text descriptions.
Notably, our method achieves 77.2% overall accuracy on ImageNet-LT, which significantly outperforms the previous best method by over 17 points.
arXiv Detail & Related papers (2021-11-26T16:24:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.