Dual-Modal Prototype Joint Learning for Compositional Zero-Shot Learning
- URL: http://arxiv.org/abs/2501.13859v1
- Date: Thu, 23 Jan 2025 17:30:27 GMT
- Title: Dual-Modal Prototype Joint Learning for Compositional Zero-Shot Learning
- Authors: Shiyu Zhang, Cheng Yan, Yang Liu, Chenchen Jing, Lei Zhou, Wenjun Wang,
- Abstract summary: Compositional Zero-Shot Learning (CZSL) aims to recognize novel compositions of attributes and objects by leveraging knowledge learned from seen compositions.
We propose a novel Dual-Modal Prototype Joint Learning framework for the CZSL task.
- Score: 15.183106475115583
- License:
- Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize novel compositions of attributes and objects by leveraging knowledge learned from seen compositions. Recent approaches have explored the use of Vision-Language Models (VLMs) to align textual and visual modalities. These methods typically employ prompt engineering, parameter-tuning, and modality fusion to generate rich textual prototypes that serve as class prototypes for CZSL. However, the modality gap results in textual prototypes being unable to fully capture the optimal representations of all class prototypes, particularly those with fine-grained features, which can be directly obtained from the visual modality. In this paper, we propose a novel Dual-Modal Prototype Joint Learning framework for the CZSL task. Our approach, based on VLMs, introduces prototypes in both the textual and visual modalities. The textual prototype is optimized to capture broad conceptual information, aiding the model's generalization across unseen compositions. Meanwhile, the visual prototype is used to mitigate the classification errors caused by the modality gap and capture fine-grained details to distinguish images with similar appearances. To effectively optimize these prototypes, we design specialized decomposition modules and a joint learning strategy that enrich the features from both modalities. These prototypes not only capture key category information during training but also serve as crucial reference targets during inference. Experimental results demonstrate that our approach achieves state-of-the-art performance in the closed-world setting and competitive performance in the open-world setting across three publicly available CZSL benchmarks. These findings validate the effectiveness of our method in advancing compositional generalization.
Related papers
- Learning Clustering-based Prototypes for Compositional Zero-shot Learning [56.57299428499455]
ClusPro is a robust clustering-based prototype mining framework for Compositional Zero-Shot Learning.
It defines the conceptual boundaries of primitives through a set of diversified prototypes.
ClusPro efficiently performs prototype clustering in a non-parametric fashion without the introduction of additional learnable parameters.
arXiv Detail & Related papers (2025-02-10T14:20:01Z) - Duplex: Dual Prototype Learning for Compositional Zero-Shot Learning [17.013498508426398]
Compositional Zero-Shot Learning (CZSL) aims to enable models to recognize novel compositions of visual states and objects that were absent during training.
We propose Duplex, a novel dual-prototype learning method that integrates semantic and visual prototypes through a carefully designed dual-branch architecture.
arXiv Detail & Related papers (2025-01-13T08:04:32Z) - Toward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP [19.697857943845012]
We propose a framework to learn class-specific vision prototypes in vision space with the help of text prototypes.
We also propose a regional semantic contrast module that contrasts regions embedding with corresponding prototypes.
Our proposed framework achieves state-of-the-art performance on two benchmark datasets.
arXiv Detail & Related papers (2024-12-27T13:55:11Z) - Discriminative Image Generation with Diffusion Models for Zero-Shot Learning [53.44301001173801]
We present DIG-ZSL, a novel Discriminative Image Generation framework for Zero-Shot Learning.
We learn a discriminative class token (DCT) for each unseen class under the guidance of a pre-trained category discrimination model (CDM)
In this paper, the extensive experiments and visualizations on four datasets show that our DIG-ZSL: (1) generates diverse and high-quality images, (2) outperforms previous state-of-the-art nonhuman-annotated semantic prototype-based methods by a large margin, and (3) achieves comparable or better performance than baselines that leverage human-annot
arXiv Detail & Related papers (2024-12-23T02:18:54Z) - Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning [13.68867780184022]
Few-shot learning aims to recognize new concepts using a limited number of visual samples.
Our framework incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs)
For the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an average improvement of 1.95% over the second-best competitor.
arXiv Detail & Related papers (2024-08-22T15:10:20Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - Multi-Modal Prototypes for Open-World Semantic Segmentation [37.84805778548119]
We propose to encompass textual and visual clues as multi-modal prototypes to allow more comprehensive support for semantic segmentation.
We decompose the high-level language information as multi-aspect prototypes and aggregate the low-level visual information as more semantic prototypes.
Based on an elastic mask prediction module, we are able to solve the zero-shot, few-shot and generalized counterpart tasks in one architecture.
arXiv Detail & Related papers (2023-07-05T03:27:31Z) - Prompting Language-Informed Distribution for Compositional Zero-Shot Learning [73.49852821602057]
Compositional zero-shot learning (CZSL) task aims to recognize unseen compositional visual concepts.
We propose a model by prompting the language-informed distribution, aka., PLID, for the task.
Experimental results on MIT-States, UT-Zappos, and C-GQA datasets show the superior performance of the PLID to the prior arts.
arXiv Detail & Related papers (2023-05-23T18:00:22Z) - Dual Prototypical Contrastive Learning for Few-shot Semantic
Segmentation [55.339405417090084]
We propose a dual prototypical contrastive learning approach tailored to the few-shot semantic segmentation (FSS) task.
The main idea is to encourage the prototypes more discriminative by increasing inter-class distance while reducing intra-class distance in prototype feature space.
We demonstrate that the proposed dual contrastive learning approach outperforms state-of-the-art FSS methods on PASCAL-5i and COCO-20i datasets.
arXiv Detail & Related papers (2021-11-09T08:14:50Z) - Partner-Assisted Learning for Few-Shot Image Classification [54.66864961784989]
Few-shot Learning has been studied to mimic human visual capabilities and learn effective models without the need of exhaustive human annotation.
In this paper, we focus on the design of training strategy to obtain an elemental representation such that the prototype of each novel class can be estimated from a few labeled samples.
We propose a two-stage training scheme, which first trains a partner encoder to model pair-wise similarities and extract features serving as soft-anchors, and then trains a main encoder by aligning its outputs with soft-anchors while attempting to maximize classification performance.
arXiv Detail & Related papers (2021-09-15T22:46:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.