Related papers: Duplex: Dual Prototype Learning for Compositional Zero-Shot Learning

Duplex: Dual Prototype Learning for Compositional Zero-Shot Learning

URL: http://arxiv.org/abs/2501.07114v1
Date: Mon, 13 Jan 2025 08:04:32 GMT
Title: Duplex: Dual Prototype Learning for Compositional Zero-Shot Learning
Authors: Zhong Peng, Yishi Xu, Gerong Wang, Wenchao Chen, Bo Chen, Jing Zhang,
Abstract summary: Compositional Zero-Shot Learning (CZSL) aims to enable models to recognize novel compositions of visual states and objects that were absent during training.<n>We propose Duplex, a novel dual-prototype learning method that integrates semantic and visual prototypes through a carefully designed dual-branch architecture.
Score: 17.013498508426398
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Compositional Zero-Shot Learning (CZSL) aims to enable models to recognize novel compositions of visual states and objects that were absent during training. Existing methods predominantly focus on learning semantic representations of seen compositions but often fail to disentangle the independent features of states and objects in images, thereby limiting their ability to generalize to unseen compositions. To address this challenge, we propose Duplex, a novel dual-prototype learning method that integrates semantic and visual prototypes through a carefully designed dual-branch architecture, enabling effective representation learning for compositional tasks. Duplex utilizes a Graph Neural Network (GNN) to adaptively update visual prototypes, capturing complex interactions between states and objects. Additionally, it leverages the strong visual-semantic alignment of pre-trained Vision-Language Models (VLMs) and employs a multi-path architecture combined with prompt engineering to align image and text representations, ensuring robust generalization. Extensive experiments on three benchmark datasets demonstrate that Duplex outperforms state-of-the-art methods in both closed-world and open-world settings.

Related papers

Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains [31.828341309787042]
Vision-language models (VLMs) achieve remarkable success in single-image tasks. Real-world scenarios often involve intricate multi-image inputs, leading to a notable performance decline. We propose Focus-Centric Visual Chain, a novel paradigm that enhances VLMs'perception, comprehension, and reasoning abilities in multi-image scenarios.
arXiv Detail & Related papers (2025-04-28T19:02:18Z)
Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning [58.73625654718187]
Generalized zero-shot learning aims to recognize both seen and unseen classes with the help of semantic information that is shared among different classes. Existing approaches fine-tune the visual backbone by seen-class data to obtain semantic-related visual features. This paper proposes a novel visual and semantic prompt collaboration framework, which utilizes prompt tuning techniques for efficient feature adaptation.
arXiv Detail & Related papers (2025-03-29T10:17:57Z)
Learning Visual Proxy for Compositional Zero-Shot Learning [15.183106475115583]
We introduce Visual Proxy Learning, a novel approach that facilitates the learning of distinct visual distributions. We propose an effective Cross-Modal Joint Learning strategy that imposes cross-modal constraints between the original text-image space and the fine-grained visual space.
arXiv Detail & Related papers (2025-01-23T17:30:27Z)
Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs [57.492124844326206]
This work delves into the task of pose-free novel view synthesis from stereo pairs, a challenging and pioneering task in 3D vision. Our innovative framework, unlike any before, seamlessly integrates 2D correspondence matching, camera pose estimation, and NeRF rendering, fostering a synergistic enhancement of these tasks.
arXiv Detail & Related papers (2023-12-12T13:22:44Z)
Synchronizing Vision and Language: Bidirectional Token-Masking AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level. We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE) BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z)
Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs) Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image. For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z)
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC) UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)
Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning [37.445883075993414]
Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs. We propose a novel paradigm for CZSL models that establishes three identification branches (i.e., Multi-Path) to jointly model the state, object, and composition. We conduct extensive experiments on three popular benchmarks, where our method significantly outperforms existing methods in both closed-world and open-world settings.
arXiv Detail & Related papers (2023-03-27T14:10:26Z)
Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning [15.406125901927004]
We propose a novel framework termed Decomposed Fusion with Soft Prompt (DFSP)1, by involving vision-language models (VLMs) for unseen composition recognition. Specifically, DFSP constructs a vector combination of learnable soft prompts with state and object to establish the joint representation of them. In addition, a cross-modal fusion module is designed between the language and image branches, which decomposes state and object among language features instead of image features.
arXiv Detail & Related papers (2022-11-19T12:29:12Z)
Part-aware Prototypical Graph Network for One-shot Skeleton-based Action Recognition [57.86960990337986]
One-shot skeleton-based action recognition poses unique challenges in learning transferable representation from base classes to novel classes. We propose a part-aware prototypical representation for one-shot skeleton-based action recognition. We demonstrate the effectiveness of our method on two public skeleton-based action recognition datasets.
arXiv Detail & Related papers (2022-08-19T04:54:56Z)
Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR) Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner. Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z)
Bowtie Networks: Generative Modeling for Joint Few-Shot Recognition and Novel-View Synthesis [39.53519330457627]
We propose a novel task of joint few-shot recognition and novel-view synthesis. We aim to simultaneously learn an object classifier and generate images of that type of object from new viewpoints. We focus on the interaction and cooperation between a generative model and a discriminative model.
arXiv Detail & Related papers (2020-08-16T19:40:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.