Learning to Compose Diversified Prompts for Image Emotion Classification
- URL: http://arxiv.org/abs/2201.10963v1
- Date: Wed, 26 Jan 2022 14:31:55 GMT
- Title: Learning to Compose Diversified Prompts for Image Emotion Classification
- Authors: Sinuo Deng, Lifang Wu, Ge Shi, Lehao Xing, Meng Jian
- Abstract summary: Contrastive Language-Image Pre-training (CLIP) represents the latest incarnation of pre-trained vision-language models.
CLIP has recently shown its superior power on a wide range of downstream vision-language tasks like Visual Question Answering.
We propose a general framework that shows how CLIP can be effectively applied to Image Emotion Classification.
- Score: 5.586293129420233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) represents the latest
incarnation of pre-trained vision-language models. Although CLIP has recently
shown its superior power on a wide range of downstream vision-language tasks
like Visual Question Answering, it is still underexplored for Image Emotion
Classification (IEC). Adapting CLIP to the IEC task has three significant
challenges, tremendous training objective gap between pretraining and IEC,
shared suboptimal and invariant prompts for all instances. In this paper, we
propose a general framework that shows how CLIP can be effectively applied to
IEC. We first introduce a prompt tuning method that mimics the pretraining
objective of CLIP and thus can leverage the rich image and text semantics
entailed in CLIP. Then we automatically compose instance-specific prompts by
conditioning them on the categories and image contents of instances,
diversifying prompts and avoiding suboptimal problems. Evaluations on six
widely-used affective datasets demonstrate that our proposed method outperforms
the state-of-the-art methods to a large margin (i.e., up to 9.29% accuracy gain
on EmotionROI dataset) on IEC tasks, with only a few parameters trained. Our
codes will be publicly available for research purposes.
Related papers
- Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic
Segmentation [44.28355088831045]
We first demonstrate the necessity of image-pixel CLIP feature adaption, then provide Multi-View Prompt learning (MVP-SEG)
MVP-SEG is an effective solution to achieve image-pixel adaptation and to solve open-vocabulary semantic segmentation.
Experiments show that the multi-view prompts learned from seen categories have strong generalization to unseen categories.
arXiv Detail & Related papers (2023-04-14T07:01:47Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly
Supervised Semantic Segmentation [19.208559353954833]
This paper explores the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels.
To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES.
arXiv Detail & Related papers (2022-12-16T06:23:59Z) - CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.