CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot
Learning
- URL: http://arxiv.org/abs/2305.16681v2
- Date: Wed, 8 Nov 2023 02:08:56 GMT
- Title: CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot
Learning
- Authors: Zhaoheng Zheng, Haidong Zhu and Ram Nevatia
- Abstract summary: We study the problem of Compositional Zero-Shot Learning (CZSL), which is to recognize novel attribute-object combinations with pre-existing concepts.
We propose to insert adapters, a parameter-efficient technique proven to be effective among large language models, into each CLIP encoder layer.
We further equip adapters with concept awareness so that concept-specific features of "object", "attribute", and "composition" can be extracted.
- Score: 14.496173899477283
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study the problem of Compositional Zero-Shot Learning
(CZSL), which is to recognize novel attribute-object combinations with
pre-existing concepts. Recent researchers focus on applying large-scale
Vision-Language Pre-trained (VLP) models like CLIP with strong generalization
ability. However, these methods treat the pre-trained model as a black box and
focus on pre- and post-CLIP operations, which do not inherently mine the
semantic concept between the layers inside CLIP. We propose to dive deep into
the architecture and insert adapters, a parameter-efficient technique proven to
be effective among large language models, into each CLIP encoder layer. We
further equip adapters with concept awareness so that concept-specific features
of "object", "attribute", and "composition" can be extracted. We assess our
method on four popular CZSL datasets, MIT-States, C-GQA, UT-Zappos, and
VAW-CZSL, which shows state-of-the-art performance compared to existing methods
on all of them.
Related papers
- Concept Visualization: Explaining the CLIP Multi-modal Embedding Using WordNet [4.597864989500202]
We propose a novel saliency methodology that explains the CLIP embedding of an image by exploiting the multi-modal nature of the embeddings.
ConVis makes use of lexical information from WordNet to compute task-agnostic Saliency Maps for any concept, not limited to concepts the end model was trained on.
arXiv Detail & Related papers (2024-05-23T13:41:17Z) - Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models [53.48409081555687]
In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features.
We propose a simple yet effective model that only relies on feed-forward neural networks.
Our framework achieves state-of-the-art performance on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL.
arXiv Detail & Related papers (2024-04-09T13:39:37Z) - CLIP Can Understand Depth [5.6138460823631835]
We adapt CLIP for meaningful quality of monocular depth estimation with dense prediction.
Our model exhibits impressive performance matching several previous state-of-the-art vision-only models.
arXiv Detail & Related papers (2024-02-05T18:09:33Z) - Concept-Guided Prompt Learning for Generalization in Vision-Language
Models [33.361744437967126]
We propose Concept-Guided Prompt Learning for vision-language models.
We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache.
In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
arXiv Detail & Related papers (2024-01-15T04:04:47Z) - Enhancing Few-shot CLIP with Semantic-Aware Fine-Tuning [61.902254546858465]
Methods based on Contrastive Language-Image Pre-training have exhibited promising performance in few-shot adaptation tasks.
We propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics.
arXiv Detail & Related papers (2023-11-08T05:18:57Z) - Class Incremental Learning with Pre-trained Vision-Language Models [59.15538370859431]
We propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation.
Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
arXiv Detail & Related papers (2023-10-31T10:45:03Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Prompting Language-Informed Distribution for Compositional Zero-Shot Learning [73.49852821602057]
Compositional zero-shot learning (CZSL) task aims to recognize unseen compositional visual concepts.
We propose a model by prompting the language-informed distribution, aka., PLID, for the task.
Experimental results on MIT-States, UT-Zappos, and C-GQA datasets show the superior performance of the PLID to the prior arts.
arXiv Detail & Related papers (2023-05-23T18:00:22Z) - Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category.
We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP.
Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z) - StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based
Domain Generalization [26.08922351077744]
StyLIP is a novel approach for Domain Generalization that enhances CLIP's classification performance across domains.
Our method focuses on a domain-agnostic prompt learning strategy, aiming to disentangle the visual style and content information embedded in CLIP's pre-trained vision encoder.
arXiv Detail & Related papers (2023-02-18T07:36:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.