Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition
- URL: http://arxiv.org/abs/2412.06190v1
- Date: Mon, 09 Dec 2024 04:00:18 GMT
- Title: Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition
- Authors: Haijing Liu, Tao Pu, Hefeng Wu, Keze Wang, Liang Lin,
- Abstract summary: We propose a novel category-adaptive cross-modal semantic refinement and transfer (C$2$SRT) framework to explore the semantic correlation.
The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module.
Experiments on OV-MLR benchmarks clearly demonstrate that the proposed C$2$SRT framework outperforms current state-of-the-art algorithms.
- Score: 59.203152078315235
- License:
- Abstract: Benefiting from the generalization capability of CLIP, recent vision language pre-training (VLP) models have demonstrated an impressive ability to capture virtually any visual concept in daily images. However, due to the presence of unseen categories in open-vocabulary settings, existing algorithms struggle to effectively capture strong semantic correlations between categories, resulting in sub-optimal performance on the open-vocabulary multi-label recognition (OV-MLR). Furthermore, the substantial variation in the number of discriminative areas across diverse object categories is misaligned with the fixed-number patch matching used in current methods, introducing noisy visual cues that hinder the accurate capture of target semantics. To tackle these challenges, we propose a novel category-adaptive cross-modal semantic refinement and transfer (C$^2$SRT) framework to explore the semantic correlation both within each category and across different categories, in a category-adaptive manner. The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module. Specifically, the ISR module leverages the cross-modal knowledge of the VLP model to adaptively find a set of local discriminative regions that best represent the semantics of the target category. The IST module adaptively discovers a set of most correlated categories for a target category by utilizing the commonsense capabilities of LLMs to construct a category-adaptive correlation graph and transfers semantic knowledge from the correlated seen categories to unseen ones. Extensive experiments on OV-MLR benchmarks clearly demonstrate that the proposed C$^2$SRT framework outperforms current state-of-the-art algorithms.
Related papers
- Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels [19.740929527669483]
Multi-label recognition with partial labels (MLR-PL) is a practical task in computer vision.
We introduce a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework.
Our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.
arXiv Detail & Related papers (2024-12-14T14:31:36Z) - Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification [8.139529179222844]
Category-Prompt Refined Feature Learning (CPRFL) is a novel approach for Long-Tailed Multi-Label image Classification.
CPRFL initializes category-prompts from the pretrained CLIP's embeddings and decouples category-specific visual representations.
We validate the effectiveness of our method on two LTMLC benchmarks and extensive experiments demonstrate the superiority of our work over baselines.
arXiv Detail & Related papers (2024-08-15T12:51:57Z) - Dual-Modal Prompting for Sketch-Based Image Retrieval [76.12076969949062]
We propose a dual-modal CLIP (DP-CLIP) network, in which an adaptive prompting strategy is designed.
We employ a set of images within the target category and the textual category label to respectively construct a set of category-adaptive prompt tokens and channel scales.
Our DP-CLIP outperforms the state-of-the-art fine-grained zero-shot method by 7.3% in Acc.@1 on the Sketchy dataset.
arXiv Detail & Related papers (2024-04-29T13:43:49Z) - Balanced Classification: A Unified Framework for Long-Tailed Object
Detection [74.94216414011326]
Conventional detectors suffer from performance degradation when dealing with long-tailed data due to a classification bias towards the majority head categories.
We introduce a unified framework called BAlanced CLassification (BACL), which enables adaptive rectification of inequalities caused by disparities in category distribution.
BACL consistently achieves performance improvements across various datasets with different backbones and architectures.
arXiv Detail & Related papers (2023-08-04T09:11:07Z) - Cluster-to-adapt: Few Shot Domain Adaptation for Semantic Segmentation
across Disjoint Labels [80.05697343811893]
Cluster-to-Adapt (C2A) is a computationally efficient clustering-based approach for domain adaptation across segmentation datasets.
We show that such a clustering objective enforced in a transformed feature space serves to automatically select categories across source and target domains.
arXiv Detail & Related papers (2022-08-04T17:57:52Z) - Semantic Representation and Dependency Learning for Multi-Label Image
Recognition [76.52120002993728]
We propose a novel and effective semantic representation and dependency learning (SRDL) framework to learn category-specific semantic representation for each category.
Specifically, we design a category-specific attentional regions (CAR) module to generate channel/spatial-wise attention matrices to guide model.
We also design an object erasing (OE) module to implicitly learn semantic dependency among categories by erasing semantic-aware regions.
arXiv Detail & Related papers (2022-04-08T00:55:15Z) - Towards Novel Target Discovery Through Open-Set Domain Adaptation [73.81537683043206]
Open-set domain adaptation (OSDA) considers that the target domain contains samples from novel categories unobserved in external source domain.
We propose a novel framework to accurately identify the seen categories in target domain, and effectively recover the semantic attributes for unseen categories.
arXiv Detail & Related papers (2021-05-06T04:22:29Z) - Unsupervised Domain Adaptation in Semantic Segmentation via Orthogonal
and Clustered Embeddings [25.137859989323537]
We propose an effective Unsupervised Domain Adaptation (UDA) strategy, based on a feature clustering method.
We introduce two novel learning objectives to enhance the discriminative clustering performance.
arXiv Detail & Related papers (2020-11-25T10:06:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.