Related papers: CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot Learning

CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot Learning

URL: http://arxiv.org/abs/2305.16681v2
Date: Wed, 8 Nov 2023 02:08:56 GMT
Title: CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot Learning
Authors: Zhaoheng Zheng, Haidong Zhu and Ram Nevatia
Abstract summary: We study the problem of Compositional Zero-Shot Learning (CZSL), which is to recognize novel attribute-object combinations with pre-existing concepts. We propose to insert adapters, a parameter-efficient technique proven to be effective among large language models, into each CLIP encoder layer. We further equip adapters with concept awareness so that concept-specific features of "object", "attribute", and "composition" can be extracted.
Score: 14.496173899477283
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we study the problem of Compositional Zero-Shot Learning (CZSL), which is to recognize novel attribute-object combinations with pre-existing concepts. Recent researchers focus on applying large-scale Vision-Language Pre-trained (VLP) models like CLIP with strong generalization ability. However, these methods treat the pre-trained model as a black box and focus on pre- and post-CLIP operations, which do not inherently mine the semantic concept between the layers inside CLIP. We propose to dive deep into the architecture and insert adapters, a parameter-efficient technique proven to be effective among large language models, into each CLIP encoder layer. We further equip adapters with concept awareness so that concept-specific features of "object", "attribute", and "composition" can be extracted. We assess our method on four popular CZSL datasets, MIT-States, C-GQA, UT-Zappos, and VAW-CZSL, which shows state-of-the-art performance compared to existing methods on all of them.

Related papers

CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting [53.15827818829865]
Methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies.<n>We propose CCL-LGS, a novel framework that enforces view-consistent semantic supervision by integrating multi-view semantic cues.<n>Our framework explicitly resolves semantic conflicts while preserving category discriminability.
arXiv Detail & Related papers (2025-05-26T19:09:33Z)
Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder. Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder. Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
Quantifying Interpretability in CLIP Models with Concept Consistency [5.921976812527759]
We study conceptual consistency of text descriptions for attention heads in CLIP-like models. We propose Concept Consistency Score (CCS), a novel interpretability metric. We find that high CCS heads capture essential concepts and play a key role in out-of-domain detection, concept-specific reasoning, and video-language understanding.
arXiv Detail & Related papers (2025-03-14T05:47:17Z)
Compositional Zero-Shot Learning with Contextualized Cues and Adaptive Contrastive Training [17.893694262999826]
This paper introduces a novel framework, Understanding and Linking Attributes and Objects (ULAO) in Compositional Zero-Shot Learning (CZSL) ULAO comprises two innovative modules. The Understanding Attributes and Objects (UAO) module improves primitive understanding by sequential primitive prediction and leveraging recognized objects as contextual hints for attribute classification. The Linking Attributes and Objects (LAO) module improves the attribute-object linkage understanding through a new contrastive learning strategy that incorporates tailored hard negative generation and adaptive loss adjustments.
arXiv Detail & Related papers (2024-12-10T03:41:20Z)
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements [0.6990493129893112]
Recent advances in Vision Language Models have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks.
arXiv Detail & Related papers (2024-11-18T20:31:38Z)
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models [53.48409081555687]
In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features. We propose a simple yet effective model that only relies on feed-forward neural networks. Our framework achieves state-of-the-art performance on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL.
arXiv Detail & Related papers (2024-04-09T13:39:37Z)
CLIP Can Understand Depth [5.6138460823631835]
We adapt CLIP for meaningful quality of monocular depth estimation with dense prediction. Our model exhibits impressive performance matching several previous state-of-the-art vision-only models.
arXiv Detail & Related papers (2024-02-05T18:09:33Z)
Concept-Guided Prompt Learning for Generalization in Vision-Language Models [33.361744437967126]
We propose Concept-Guided Prompt Learning for vision-language models. We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache. In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
arXiv Detail & Related papers (2024-01-15T04:04:47Z)
Enhancing Few-shot CLIP with Semantic-Aware Fine-Tuning [61.902254546858465]
Methods based on Contrastive Language-Image Pre-training have exhibited promising performance in few-shot adaptation tasks. We propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics.
arXiv Detail & Related papers (2023-11-08T05:18:57Z)
Class Incremental Learning with Pre-trained Vision-Language Models [59.15538370859431]
We propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation. Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
arXiv Detail & Related papers (2023-10-31T10:45:03Z)
Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z)
Prompting Language-Informed Distribution for Compositional Zero-Shot Learning [73.49852821602057]
Compositional zero-shot learning (CZSL) task aims to recognize unseen compositional visual concepts. We propose a model by prompting the language-informed distribution, aka., PLID, for the task. Experimental results on MIT-States, UT-Zappos, and C-GQA datasets show the superior performance of the PLID to the prior arts.
arXiv Detail & Related papers (2023-05-23T18:00:22Z)
Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category. We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP. Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z)
StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based Domain Generalization [26.08922351077744]
StyLIP is a novel approach for Domain Generalization that enhances CLIP's classification performance across domains. Our method focuses on a domain-agnostic prompt learning strategy, aiming to disentangle the visual style and content information embedded in CLIP's pre-trained vision encoder.
arXiv Detail & Related papers (2023-02-18T07:36:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.