Decomposed Soft Prompt Guided Fusion Enhancing for Compositional
Zero-Shot Learning
- URL: http://arxiv.org/abs/2211.10681v1
- Date: Sat, 19 Nov 2022 12:29:12 GMT
- Title: Decomposed Soft Prompt Guided Fusion Enhancing for Compositional
Zero-Shot Learning
- Authors: Xiaocheng Lu, Ziming Liu, Song Guo, Jingcai Guo
- Abstract summary: We propose a novel framework termed Decomposed Fusion with Soft Prompt (DFSP)1, by involving vision-language models (VLMs) for unseen composition recognition.
Specifically, DFSP constructs a vector combination of learnable soft prompts with state and object to establish the joint representation of them.
In addition, a cross-modal fusion module is designed between the language and image branches, which decomposes state and object among language features instead of image features.
- Score: 15.406125901927004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize novel concepts
formed by known states and objects during training. Existing methods either
learn the combined state-object representation, challenging the generalization
of unseen compositions, or design two classifiers to identify state and object
separately from image features, ignoring the intrinsic relationship between
them. To jointly eliminate the above issues and construct a more robust CZSL
system, we propose a novel framework termed Decomposed Fusion with Soft Prompt
(DFSP)1, by involving vision-language models (VLMs) for unseen composition
recognition. Specifically, DFSP constructs a vector combination of learnable
soft prompts with state and object to establish the joint representation of
them. In addition, a cross-modal decomposed fusion module is designed between
the language and image branches, which decomposes state and object among
language features instead of image features. Notably, being fused with the
decomposed features, the image features can be more expressive for learning the
relationship with states and objects, respectively, to improve the response of
unseen compositions in the pair space, hence narrowing the domain gap between
seen and unseen sets. Experimental results on three challenging benchmarks
demonstrate that our approach significantly outperforms other state-of-the-art
methods by large margins.
Related papers
- Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Edge Guided GANs with Multi-Scale Contrastive Learning for Semantic
Image Synthesis [139.2216271759332]
We propose a novel ECGAN for the challenging semantic image synthesis task.
The semantic labels do not provide detailed structural information, making it challenging to synthesize local details and structures.
The widely adopted CNN operations such as convolution, down-sampling, and normalization usually cause spatial resolution loss.
We propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content.
arXiv Detail & Related papers (2023-07-22T14:17:19Z) - Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning [37.445883075993414]
Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs.
We propose a novel paradigm for CZSL models that establishes three identification branches (i.e., Multi-Path) to jointly model the state, object, and composition.
We conduct extensive experiments on three popular benchmarks, where our method significantly outperforms existing methods in both closed-world and open-world settings.
arXiv Detail & Related papers (2023-03-27T14:10:26Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Mutual Balancing in State-Object Components for Compositional Zero-Shot
Learning [0.0]
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen compositions from seen states and objects.
We propose a novel method called MUtual balancing in STate-object components (MUST) for CZSL, which provides a balancing inductive bias for the model.
Our approach significantly outperforms the state-of-the-art on MIT-States, UT-Zappos, and C-GQA when combined with the basic CZSL frameworks.
arXiv Detail & Related papers (2022-11-19T10:21:22Z) - ProCC: Progressive Cross-primitive Compatibility for Open-World
Compositional Zero-Shot Learning [29.591615811894265]
Open-World Compositional Zero-shot Learning (OW-CZSL) aims to recognize novel compositions of state and object primitives in images with no priors on the compositional space.
We propose a novel method, termed Progressive Cross-primitive Compatibility (ProCC), to mimic the human learning process for OW-CZSL tasks.
arXiv Detail & Related papers (2022-11-19T10:09:46Z) - Siamese Contrastive Embedding Network for Compositional Zero-Shot
Learning [76.13542095170911]
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen compositions formed from seen state and object during training.
We propose a novel Siamese Contrastive Embedding Network (SCEN) for unseen composition recognition.
Our method significantly outperforms the state-of-the-art approaches on three challenging benchmark datasets.
arXiv Detail & Related papers (2022-06-29T09:02:35Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - Semantic Disentangling Generalized Zero-Shot Learning [50.259058462272435]
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories.
In this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture.
The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images.
arXiv Detail & Related papers (2021-01-20T05:46:21Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.