Related papers: Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning

Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning

URL: http://arxiv.org/abs/2211.10681v1
Date: Sat, 19 Nov 2022 12:29:12 GMT
Title: Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning
Authors: Xiaocheng Lu, Ziming Liu, Song Guo, Jingcai Guo
Abstract summary: We propose a novel framework termed Decomposed Fusion with Soft Prompt (DFSP)1, by involving vision-language models (VLMs) for unseen composition recognition. Specifically, DFSP constructs a vector combination of learnable soft prompts with state and object to establish the joint representation of them. In addition, a cross-modal fusion module is designed between the language and image branches, which decomposes state and object among language features instead of image features.
Score: 15.406125901927004
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize novel concepts formed by known states and objects during training. Existing methods either learn the combined state-object representation, challenging the generalization of unseen compositions, or design two classifiers to identify state and object separately from image features, ignoring the intrinsic relationship between them. To jointly eliminate the above issues and construct a more robust CZSL system, we propose a novel framework termed Decomposed Fusion with Soft Prompt (DFSP)1, by involving vision-language models (VLMs) for unseen composition recognition. Specifically, DFSP constructs a vector combination of learnable soft prompts with state and object to establish the joint representation of them. In addition, a cross-modal decomposed fusion module is designed between the language and image branches, which decomposes state and object among language features instead of image features. Notably, being fused with the decomposed features, the image features can be more expressive for learning the relationship with states and objects, respectively, to improve the response of unseen compositions in the pair space, hence narrowing the domain gap between seen and unseen sets. Experimental results on three challenging benchmarks demonstrate that our approach significantly outperforms other state-of-the-art methods by large margins.

Related papers

Object-centric Binding in Contrastive Language-Image Pretraining [9.376583779399834]
We propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. Our resulting model paves the way towards more accurate and sample-efficient image-text matching of complex scenes.
arXiv Detail & Related papers (2025-02-19T21:30:51Z)
Duplex: Dual Prototype Learning for Compositional Zero-Shot Learning [17.013498508426398]
Compositional Zero-Shot Learning (CZSL) aims to enable models to recognize novel compositions of visual states and objects that were absent during training. We propose Duplex, a novel dual-prototype learning method that integrates semantic and visual prototypes through a carefully designed dual-branch architecture.
arXiv Detail & Related papers (2025-01-13T08:04:32Z)
Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning [21.488599805772054]
Compositional zero-shot learning aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object) We propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL.
arXiv Detail & Related papers (2024-11-18T07:55:54Z)
ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components. Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality. We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z)
Cross-composition Feature Disentanglement for Compositional Zero-shot Learning [49.919635694894204]
Disentanglement of visual features of primitives (i.e., attributes and objects) has shown exceptional results in Compositional Zero-shot Learning (CZSL) We propose the solution of cross-composition feature disentanglement, which takes multiple primitive-sharing compositions as inputs and constrains the disentangled primitive features to be general across these compositions.
arXiv Detail & Related papers (2024-08-19T08:23:09Z)
Synchronizing Vision and Language: Bidirectional Token-Masking AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level. We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE) BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z)
Edge Guided GANs with Multi-Scale Contrastive Learning for Semantic Image Synthesis [139.2216271759332]
We propose a novel ECGAN for the challenging semantic image synthesis task. The semantic labels do not provide detailed structural information, making it challenging to synthesize local details and structures. The widely adopted CNN operations such as convolution, down-sampling, and normalization usually cause spatial resolution loss. We propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content.
arXiv Detail & Related papers (2023-07-22T14:17:19Z)
Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning [37.445883075993414]
Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs. We propose a novel paradigm for CZSL models that establishes three identification branches (i.e., Multi-Path) to jointly model the state, object, and composition. We conduct extensive experiments on three popular benchmarks, where our method significantly outperforms existing methods in both closed-world and open-world settings.
arXiv Detail & Related papers (2023-03-27T14:10:26Z)
Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features. Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z)
Siamese Contrastive Embedding Network for Compositional Zero-Shot Learning [76.13542095170911]
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen compositions formed from seen state and object during training. We propose a novel Siamese Contrastive Embedding Network (SCEN) for unseen composition recognition. Our method significantly outperforms the state-of-the-art approaches on three challenging benchmark datasets.
arXiv Detail & Related papers (2022-06-29T09:02:35Z)
Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching. We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z)
Semantic Disentangling Generalized Zero-Shot Learning [50.259058462272435]
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories. In this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture. The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images.
arXiv Detail & Related papers (2021-01-20T05:46:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.