Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels
- URL: http://arxiv.org/abs/2412.10843v1
- Date: Sat, 14 Dec 2024 14:31:36 GMT
- Title: Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels
- Authors: Haoxian Ruan, Zhihua Xu, Zhijing Yang, Yongyi Lu, Jinghui Qin, Tianshui Chen,
- Abstract summary: Multi-label recognition with partial labels (MLR-PL) is a practical task in computer vision.
We introduce a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework.
Our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.
- Score: 19.740929527669483
- License:
- Abstract: Multi-label recognition with partial labels (MLR-PL), in which only some labels are known while others are unknown for each image, is a practical task in computer vision, since collecting large-scale and complete multi-label datasets is difficult in real application scenarios. Recently, vision language models (e.g. CLIP) have demonstrated impressive transferability to downstream tasks in data limited or label limited settings. However, current CLIP-based methods suffer from semantic confusion in MLR task due to the lack of fine-grained information in the single global visual and textual representation for all categories. In this work, we address this problem by introducing a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework. Specifically, the semantic decoupling module following the visual encoder learns category-specific feature maps by utilizing the semantic-guided spatial attention mechanism. Moreover, the category-specific prompt optimization method is introduced to learn text representations aligned with category semantics. Therefore, the prediction of each category is independent, which alleviate the semantic confusion problem. Extensive experiments on Microsoft COCO 2014 and Pascal VOC 2007 datasets demonstrate that the proposed framework significantly outperforms current state-of-art methods with a simpler model structure. Additionally, visual analysis shows that our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.
Related papers
- Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
We propose a novel category-adaptive cross-modal semantic refinement and transfer (C$2$SRT) framework to explore the semantic correlation.
The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module.
Experiments on OV-MLR benchmarks clearly demonstrate that the proposed C$2$SRT framework outperforms current state-of-the-art algorithms.
arXiv Detail & Related papers (2024-12-09T04:00:18Z) - LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation [16.864086165056698]
Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets.
We propose to alleviate the issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features.
Our method achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2024-11-30T05:49:42Z) - Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning [13.68867780184022]
Few-shot learning aims to recognize new concepts using a limited number of visual samples.
Our framework incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs)
For the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an average improvement of 1.95% over the second-best competitor.
arXiv Detail & Related papers (2024-08-22T15:10:20Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning [23.671999163027284]
This paper proposes a novel framework for multi-label image recognition without any training data.
It uses knowledge of pre-trained Large Language Model to learn prompts to adapt pretrained Vision-Language Model like CLIP to multilabel classification.
Our framework presents a new way to explore the synergies between multiple pre-trained models for novel category recognition.
arXiv Detail & Related papers (2024-03-02T13:43:32Z) - UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [90.74967596080982]
This paper extends Contrastive Language-Image Pre-training (CLIP) with multi-granularity alignment.
We develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities.
With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks.
arXiv Detail & Related papers (2024-01-12T06:35:09Z) - TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary
Multi-Label Classification of CLIP Without Training [29.431698321195814]
Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification.
CLIP shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class.
We propose a local-to-global framework to obtain image tags.
arXiv Detail & Related papers (2023-12-20T08:15:40Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Semantic Representation and Dependency Learning for Multi-Label Image
Recognition [76.52120002993728]
We propose a novel and effective semantic representation and dependency learning (SRDL) framework to learn category-specific semantic representation for each category.
Specifically, we design a category-specific attentional regions (CAR) module to generate channel/spatial-wise attention matrices to guide model.
We also design an object erasing (OE) module to implicitly learn semantic dependency among categories by erasing semantic-aware regions.
arXiv Detail & Related papers (2022-04-08T00:55:15Z) - Scaling up Multi-domain Semantic Segmentation with Sentence Embeddings [81.09026586111811]
We propose an approach to semantic segmentation that achieves state-of-the-art supervised performance when applied in a zero-shot setting.
This is achieved by replacing each class label with a vector-valued embedding of a short paragraph that describes the class.
The resulting merged semantic segmentation dataset of over 2 Million images enables training a model that achieves performance equal to that of state-of-the-art supervised methods on 7 benchmark datasets.
arXiv Detail & Related papers (2022-02-04T07:19:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.