Decomposed Soft Prompt Guided Fusion Enhancing for Compositional
Zero-Shot Learning
- URL: http://arxiv.org/abs/2211.10681v1
- Date: Sat, 19 Nov 2022 12:29:12 GMT
- Title: Decomposed Soft Prompt Guided Fusion Enhancing for Compositional
Zero-Shot Learning
- Authors: Xiaocheng Lu, Ziming Liu, Song Guo, Jingcai Guo
- Abstract summary: We propose a novel framework termed Decomposed Fusion with Soft Prompt (DFSP)1, by involving vision-language models (VLMs) for unseen composition recognition.
Specifically, DFSP constructs a vector combination of learnable soft prompts with state and object to establish the joint representation of them.
In addition, a cross-modal fusion module is designed between the language and image branches, which decomposes state and object among language features instead of image features.
- Score: 15.406125901927004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize novel concepts
formed by known states and objects during training. Existing methods either
learn the combined state-object representation, challenging the generalization
of unseen compositions, or design two classifiers to identify state and object
separately from image features, ignoring the intrinsic relationship between
them. To jointly eliminate the above issues and construct a more robust CZSL
system, we propose a novel framework termed Decomposed Fusion with Soft Prompt
(DFSP)1, by involving vision-language models (VLMs) for unseen composition
recognition. Specifically, DFSP constructs a vector combination of learnable
soft prompts with state and object to establish the joint representation of
them. In addition, a cross-modal decomposed fusion module is designed between
the language and image branches, which decomposes state and object among
language features instead of image features. Notably, being fused with the
decomposed features, the image features can be more expressive for learning the
relationship with states and objects, respectively, to improve the response of
unseen compositions in the pair space, hence narrowing the domain gap between
seen and unseen sets. Experimental results on three challenging benchmarks
demonstrate that our approach significantly outperforms other state-of-the-art
methods by large margins.
Related papers
- BIFRÖST: 3D-Aware Image compositing with Language Instructions [27.484947109237964]
Bifr"ost is a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition.
Bifr"ost addresses issues by training MLLM as a 2.5D location predictor and integrating depth maps as an extra condition during the generation process.
arXiv Detail & Related papers (2024-10-24T18:35:12Z) - ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components.
Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality.
We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z) - Cross-composition Feature Disentanglement for Compositional Zero-shot Learning [49.919635694894204]
Disentanglement of visual features of primitives (i.e., attributes and objects) has shown exceptional results in Compositional Zero-shot Learning (CZSL)
We propose the solution of cross-composition feature disentanglement, which takes multiple primitive-sharing compositions as inputs and constrains the disentangled primitive features to be general across these compositions.
arXiv Detail & Related papers (2024-08-19T08:23:09Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Edge Guided GANs with Multi-Scale Contrastive Learning for Semantic
Image Synthesis [139.2216271759332]
We propose a novel ECGAN for the challenging semantic image synthesis task.
The semantic labels do not provide detailed structural information, making it challenging to synthesize local details and structures.
The widely adopted CNN operations such as convolution, down-sampling, and normalization usually cause spatial resolution loss.
We propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content.
arXiv Detail & Related papers (2023-07-22T14:17:19Z) - Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning [37.445883075993414]
Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs.
We propose a novel paradigm for CZSL models that establishes three identification branches (i.e., Multi-Path) to jointly model the state, object, and composition.
We conduct extensive experiments on three popular benchmarks, where our method significantly outperforms existing methods in both closed-world and open-world settings.
arXiv Detail & Related papers (2023-03-27T14:10:26Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Siamese Contrastive Embedding Network for Compositional Zero-Shot
Learning [76.13542095170911]
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen compositions formed from seen state and object during training.
We propose a novel Siamese Contrastive Embedding Network (SCEN) for unseen composition recognition.
Our method significantly outperforms the state-of-the-art approaches on three challenging benchmark datasets.
arXiv Detail & Related papers (2022-06-29T09:02:35Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - Semantic Disentangling Generalized Zero-Shot Learning [50.259058462272435]
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories.
In this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture.
The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images.
arXiv Detail & Related papers (2021-01-20T05:46:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.