Dual-View Alignment Learning with Hierarchical-Prompt for Class-Imbalance Multi-Label Classification
- URL: http://arxiv.org/abs/2509.17747v1
- Date: Mon, 22 Sep 2025 13:11:12 GMT
- Title: Dual-View Alignment Learning with Hierarchical-Prompt for Class-Imbalance Multi-Label Classification
- Authors: Sheng Huang, Jiexuan Yan, Beiyan Liu, Bo Liu, Richang Hong,
- Abstract summary: Real-world datasets often exhibit class imbalance across multiple categories, manifesting as long-tailed distributions and few-shot scenarios.<n>This is especially challenging in Class-Imbalanced Multi-Label Image Classification (CI-MLIC) tasks, where data imbalance and multi-object recognition present significant obstacles.<n>We propose a novel method termed Dual-View Alignment Learning with Hierarchical Prompt (HP-DVAL) to mitigate the class-imbalance problem in multi-label settings.
- Score: 45.76234309840256
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Real-world datasets often exhibit class imbalance across multiple categories, manifesting as long-tailed distributions and few-shot scenarios. This is especially challenging in Class-Imbalanced Multi-Label Image Classification (CI-MLIC) tasks, where data imbalance and multi-object recognition present significant obstacles. To address these challenges, we propose a novel method termed Dual-View Alignment Learning with Hierarchical Prompt (HP-DVAL), which leverages multi-modal knowledge from vision-language pretrained (VLP) models to mitigate the class-imbalance problem in multi-label settings. Specifically, HP-DVAL employs dual-view alignment learning to transfer the powerful feature representation capabilities from VLP models by extracting complementary features for accurate image-text alignment. To better adapt VLP models for CI-MLIC tasks, we introduce a hierarchical prompt-tuning strategy that utilizes global and local prompts to learn task-specific and context-related prior knowledge. Additionally, we design a semantic consistency loss during prompt tuning to prevent learned prompts from deviating from general knowledge embedded in VLP models. The effectiveness of our approach is validated on two CI-MLIC benchmarks: MS-COCO and VOC2007. Extensive experimental results demonstrate the superiority of our method over SOTA approaches, achieving mAP improvements of 10.0\% and 5.2\% on the long-tailed multi-label image classification task, and 6.8\% and 2.9\% on the multi-label few-shot image classification task.
Related papers
- Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition [55.189113121465816]
We propose a novel correlation adaptation prompt network (CAPNET) for long-tailed multi-label visual recognition.<n>CAPNET explicitly models correlations from CLIP's textual encoder.<n>It improves generalization through test-time ensembling and realigns visual-textual modalities.
arXiv Detail & Related papers (2025-11-25T18:57:28Z) - Vision Large Language Models Are Good Noise Handlers in Engagement Analysis [54.397912827957164]
We propose a framework leveraging Vision Large Language Models (VLMs) to refine annotations and guide the training process.<n>Our framework uses a questionnaire to extract behavioral cues and split data into high- and low-reliability subsets.<n>We demonstrate that classical computer vision models trained on refined high-reliability subsets and enhanced with our curriculum strategy show improvements.
arXiv Detail & Related papers (2025-11-18T18:50:26Z) - Improving Multi-label Recognition using Class Co-Occurrence Probabilities [7.062238472483738]
Multi-label Recognition (MLR) involves the identification of multiple objects within an image.
Recent works have leveraged information from vision-language models (VLMs) trained on large text-images datasets for the task.
We propose a framework to extend the independent classifiers by incorporating the co-occurrence information for object pairs.
arXiv Detail & Related papers (2024-04-24T20:33:25Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - LAMM: Label Alignment for Multi-Modal Prompt Learning [17.478967970736115]
We introduce an innovative label alignment method named textbfLAMM, which can adjust the category embeddings of downstream datasets through end-to-end training.
Our method significantly improves the performance of existing multi-modal prompt learning models in few-shot scenarios.
Our methodology exhibits the preeminence in continual learning compared to other prompt tuning methods.
arXiv Detail & Related papers (2023-12-13T15:29:52Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Boosting Multi-Label Image Classification with Complementary Parallel
Self-Distillation [15.518137695660668]
Multi-Label Image Classification approaches usually exploit label correlations to achieve good performance.
emphasizing correlation like co-occurrence may overlook discriminative features of the target itself and lead to model overfitting.
In this study, we propose a generic framework named Parallel Self-Distillation (PSD) for boosting MLIC models.
arXiv Detail & Related papers (2022-05-23T01:28:38Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Multitask Learning for Class-Imbalanced Discourse Classification [74.41900374452472]
We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks.
We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP.
arXiv Detail & Related papers (2021-01-02T07:13:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.