Prompt Tuning for Zero-shot Compositional Learning
- URL: http://arxiv.org/abs/2312.02191v1
- Date: Sat, 2 Dec 2023 07:32:24 GMT
- Title: Prompt Tuning for Zero-shot Compositional Learning
- Authors: Lingyu Zhang, Ting Hua, Yilin Shen, Hongxia Jin
- Abstract summary: We propose a framework named Multi-Modal Prompt Tuning (MMPT) to inherit the "knowledgeable" property from the large pre-trained vision-language model.
On the UT-Zappos dataset, MMPT pushes the AUC score to $29.8$, while the previous best score is $26.5$.
On the more challenging MIT-States dataset, the AUC score of MMPT is 1.5 times better than the current state-of-the-art.
- Score: 53.090335182962605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open World Compositional Zero-Shot Learning (OW-CZSL) is known to be an
extremely challenging task, which aims to recognize unseen compositions formed
from seen attributes and objects without any prior assumption of the output
space. In order to achieve this goal, a model has to be "smart" and
"knowledgeable". To be smart, a model should be good at reasoning the
interactions between attributes and objects from the seen compositions. While
"knowledgeable" means the model owns "common sense" to the open world that can
"foresee" some features of the unseen compositions. Most previous work focuses
on the "smart" part, while few of them provided an effective solution to
achieve the "knowledgeable" goal. In this paper, we proposed a framework named
Multi-Modal Prompt Tuning (MMPT) to inherit the "knowledgeable" property from
the large pre-trained vision-language model. Extensive experiments show that
our proposed MMPT obtains new state-of-the-art results in OW-CZSL task. On the
UT-Zappos dataset, MMPT pushes the AUC score to $29.8$, while the previous best
score is $26.5$. On the more challenging MIT-States dataset, the AUC score of
MMPT is 1.5 times better than the current state-of-the-art.
Related papers
- Attention Based Simple Primitives for Open World Compositional Zero-Shot Learning [12.558701595138928]
Compositional Zero-Shot Learning (CZSL) aims to predict unknown compositions made up of attribute and object pairs.
We are exploring Open World Compositional Zero-Shot Learning (OW-CZSL) in this study, where our test space encompasses all potential combinations of attributes and objects.
Our approach involves utilizing the self-attention mechanism between attributes and objects to achieve better generalization from seen to unseen compositions.
arXiv Detail & Related papers (2024-07-18T17:11:29Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - Compositional Semantics for Open Vocabulary Spatio-semantic
Representations [4.045603788443984]
General-purpose mobile robots need to complete tasks without exact human instructions.
We propose latent semantic embeddings z* as principled learning-based knowledge representation for queryable-semantic memories.
We demonstrate that a simple dense VLM trained on the COCO-Stuff dataset can learn z* for 181 overlapping semantics by 42.23 mIoU.
arXiv Detail & Related papers (2023-10-08T03:07:14Z) - Hierarchical Visual Primitive Experts for Compositional Zero-Shot
Learning [52.506434446439776]
Compositional zero-shot learning (CZSL) aims to recognize compositions with prior knowledge of known primitives (attribute and object)
We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues.
Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL.
arXiv Detail & Related papers (2023-08-08T03:24:21Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Unsupervised Object-Centric Learning with Bi-Level Optimized Query Slot
Attention [26.25900877220557]
Slot-Attention module has played an important role with its simple yet effective design and fostered many powerful variants.
We propose to address these issues by initializing Slot-Attention modules with learnable queries and (2) optimizing the model with bi-level optimization.
Our model achieves state-of-the-art results on both synthetic and complex real-world datasets in unsupervised image segmentation and reconstruction.
arXiv Detail & Related papers (2022-10-17T12:14:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.