Exploring CLIP for Assessing the Look and Feel of Images
- URL: http://arxiv.org/abs/2207.12396v1
- Date: Mon, 25 Jul 2022 17:58:16 GMT
- Title: Exploring CLIP for Assessing the Look and Feel of Images
- Authors: Jianyi Wang, Kelvin C.K. Chan, Chen Change Loy
- Abstract summary: We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
- Score: 87.97623543523858
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Measuring the perception of visual content is a long-standing problem in
computer vision. Many mathematical models have been developed to evaluate the
look or quality of an image. Despite the effectiveness of such tools in
quantifying degradations such as noise and blurriness levels, such
quantification is loosely coupled with human language. When it comes to more
abstract perception about the feel of visual content, existing methods can only
rely on supervised models that are explicitly trained with labeled data
collected via laborious user study. In this paper, we go beyond the
conventional paradigms by exploring the rich visual language prior encapsulated
in Contrastive Language-Image Pre-training (CLIP) models for assessing both the
quality perception (look) and abstract perception (feel) of images in a
zero-shot manner. In particular, we discuss effective prompt designs and show
an effective prompt pairing strategy to harness the prior. We also provide
extensive experiments on controlled datasets and Image Quality Assessment (IQA)
benchmarks. Our results show that CLIP captures meaningful priors that
generalize well to different perceptual assessments. Code will be avaliable at
https://github.com/IceClear/CLIP-IQA.
Related papers
- Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind AI Generated Image Quality Assessment [57.07360640784803]
We propose vision-language consistency guided multi-modal prompt learning for blind image quality assessment (AGIQA)
Specifically, we introduce learnable textual and visual prompts in language and vision branches of Contrastive Language-Image Pre-training (CLIP) models.
We design a text-to-image alignment quality prediction task, whose learned vision-language consistency knowledge is used to guide the optimization of the above multi-modal prompts.
arXiv Detail & Related papers (2024-06-24T13:45:31Z) - Assessing Image Quality Using a Simple Generative Representation [34.173947968362675]
VAE-QA is a simple and efficient method for predicting image quality in the presence of a full-reference.
We evaluate our approach on four standard benchmarks and find that it significantly improves generalization across datasets.
arXiv Detail & Related papers (2024-04-28T13:18:47Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - Unpaired Image Captioning by Image-level Weakly-Supervised Visual
Concept Recognition [83.93422034664184]
Unpaired image captioning (UIC) is to describe images without using image-caption pairs in the training phase.
Most existing studies use off-the-shelf algorithms to obtain the visual concepts.
We propose a novel approach to achieve cost-effective UIC using image-level labels.
arXiv Detail & Related papers (2022-03-07T08:02:23Z) - Detection and Captioning with Unseen Object Classes [12.894104422808242]
Test images may contain visual objects with no corresponding visual or textual training examples.
We propose a detection-driven approach based on a generalized zero-shot detection model and a template-based sentence generation model.
Our experiments show that the proposed zero-shot detection model obtains state-of-the-art performance on the MS-COCO dataset.
arXiv Detail & Related papers (2021-08-13T10:43:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.