Related papers: Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification

Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification

URL: http://arxiv.org/abs/2408.08125v1
Date: Thu, 15 Aug 2024 12:51:57 GMT
Title: Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification
Authors: Jiexuan Yan, Sheng Huang, Nankun Mu, Luwen Huangfu, Bo Liu,
Abstract summary: Category-Prompt Refined Feature Learning (CPRFL) is a novel approach for Long-Tailed Multi-Label image Classification. CPRFL initializes category-prompts from the pretrained CLIP's embeddings and decouples category-specific visual representations. We validate the effectiveness of our method on two LTMLC benchmarks and extensive experiments demonstrate the superiority of our work over baselines.
Score: 8.139529179222844
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Real-world data consistently exhibits a long-tailed distribution, often spanning multiple categories. This complexity underscores the challenge of content comprehension, particularly in scenarios requiring Long-Tailed Multi-Label image Classification (LTMLC). In such contexts, imbalanced data distribution and multi-object recognition pose significant hurdles. To address this issue, we propose a novel and effective approach for LTMLC, termed Category-Prompt Refined Feature Learning (CPRFL), utilizing semantic correlations between different categories and decoupling category-specific visual representations for each category. Specifically, CPRFL initializes category-prompts from the pretrained CLIP's embeddings and decouples category-specific visual representations through interaction with visual features, thereby facilitating the establishment of semantic correlations between the head and tail classes. To mitigate the visual-semantic domain bias, we design a progressive Dual-Path Back-Propagation mechanism to refine the prompts by progressively incorporating context-related visual information into prompts. Simultaneously, the refinement process facilitates the progressive purification of the category-specific visual representations under the guidance of the refined prompts. Furthermore, taking into account the negative-positive sample imbalance, we adopt the Asymmetric Loss as our optimization objective to suppress negative samples across all classes and potentially enhance the head-to-tail recognition performance. We validate the effectiveness of our method on two LTMLC benchmarks and extensive experiments demonstrate the superiority of our work over baselines. The code is available at https://github.com/jiexuanyan/CPRFL.

Related papers

Training-Free Class Purification for Open-Vocabulary Semantic Segmentation [72.87707878910896]
FreeCP is a training-free class purification framework for semantic segmentation.<n>We conduct experiments across eight benchmarks to validate FreeCP's effectiveness.<n>Results demonstrate that FreeCP, as a plug-and-play module, significantly boosts segmentation performance when combined with other OVSS methods.
arXiv Detail & Related papers (2025-08-01T11:55:12Z)
SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting [70.49268117587562]
We propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories. During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories.
arXiv Detail & Related papers (2025-04-24T09:31:08Z)
Semantic-guided Representation Learning for Multi-Label Recognition [13.046479112800608]
Multi-label Recognition (MLR) involves assigning multiple labels to each data instance in an image. Recent Vision and Language Pre-training methods have made significant progress in tackling zero-shot MLR tasks. We introduce a Semantic-guided Representation Learning approach (SigRL) that enables the model to learn effective visual and textual representations.
arXiv Detail & Related papers (2025-04-04T08:15:08Z)
Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels [19.740929527669483]
Multi-label recognition with partial labels (MLR-PL) is a practical task in computer vision. We introduce a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework. Our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.
arXiv Detail & Related papers (2024-12-14T14:31:36Z)
Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
We propose a novel category-adaptive cross-modal semantic refinement and transfer (C$2$SRT) framework to explore the semantic correlation. The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module. Experiments on OV-MLR benchmarks clearly demonstrate that the proposed C$2$SRT framework outperforms current state-of-the-art algorithms.
arXiv Detail & Related papers (2024-12-09T04:00:18Z)
Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning [13.68867780184022]
Few-shot learning aims to recognize new concepts using a limited number of visual samples. Our framework incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs) For the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an average improvement of 1.95% over the second-best competitor.
arXiv Detail & Related papers (2024-08-22T15:10:20Z)
Dual-Modal Prompting for Sketch-Based Image Retrieval [76.12076969949062]
We propose a dual-modal CLIP (DP-CLIP) network, in which an adaptive prompting strategy is designed. We employ a set of images within the target category and the textual category label to respectively construct a set of category-adaptive prompt tokens and channel scales. Our DP-CLIP outperforms the state-of-the-art fine-grained zero-shot method by 7.3% in Acc.@1 on the Sketchy dataset.
arXiv Detail & Related papers (2024-04-29T13:43:49Z)
Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction [17.989559761931435]
We propose a novel "Fine-grained Visual-Semantic Interaction" framework for WSI classification. It is designed to enhance the model's generalizability by leveraging the interaction between localized visual patterns and fine-grained pathological semantics. Our method demonstrates robust generalizability and strong transferability, dominantly outperforming the counterparts on the TCGA Lung Cancer dataset.
arXiv Detail & Related papers (2024-02-29T16:29:53Z)
Balanced Classification: A Unified Framework for Long-Tailed Object Detection [74.94216414011326]
Conventional detectors suffer from performance degradation when dealing with long-tailed data due to a classification bias towards the majority head categories. We introduce a unified framework called BAlanced CLassification (BACL), which enables adaptive rectification of inequalities caused by disparities in category distribution. BACL consistently achieves performance improvements across various datasets with different backbones and architectures.
arXiv Detail & Related papers (2023-08-04T09:11:07Z)
LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition [12.62835357920401]
We propose a unified framework for LTML, namely prompt tuning with class-specific embedding loss (LMPT) Our method significantly surpasses the previous state-of-the-art methods and zero-shot CLIP in LTML.
arXiv Detail & Related papers (2023-05-08T08:14:46Z)
Not All Instances Contribute Equally: Instance-adaptive Class Representation Learning for Few-Shot Visual Recognition [94.04041301504567]
Few-shot visual recognition refers to recognize novel visual concepts from a few labeled instances. We propose a novel metric-based meta-learning framework termed instance-adaptive class representation learning network (ICRL-Net) for few-shot visual recognition.
arXiv Detail & Related papers (2022-09-07T10:00:18Z)
Semantic Representation and Dependency Learning for Multi-Label Image Recognition [76.52120002993728]
We propose a novel and effective semantic representation and dependency learning (SRDL) framework to learn category-specific semantic representation for each category. Specifically, we design a category-specific attentional regions (CAR) module to generate channel/spatial-wise attention matrices to guide model. We also design an object erasing (OE) module to implicitly learn semantic dependency among categories by erasing semantic-aware regions.
arXiv Detail & Related papers (2022-04-08T00:55:15Z)
CAD: Co-Adapting Discriminative Features for Improved Few-Shot Classification [11.894289991529496]
Few-shot classification is a challenging problem that aims to learn a model that can adapt to unseen classes given a few labeled samples. Recent approaches pre-train a feature extractor, and then fine-tune for episodic meta-learning. We propose a strategy to cross-attend and re-weight discriminative features for few-shot classification.
arXiv Detail & Related papers (2022-03-25T06:14:51Z)
Learning Debiased and Disentangled Representations for Semantic Segmentation [52.35766945827972]
We propose a model-agnostic and training scheme for semantic segmentation. By randomly eliminating certain class information in each training iteration, we effectively reduce feature dependencies among classes. Models trained with our approach demonstrate strong results on multiple semantic segmentation benchmarks.
arXiv Detail & Related papers (2021-10-31T16:15:09Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.