Open Vocabulary Multi-Label Classification with Dual-Modal Decoder on
Aligned Visual-Textual Features
- URL: http://arxiv.org/abs/2208.09562v2
- Date: Sat, 7 Oct 2023 22:24:42 GMT
- Title: Open Vocabulary Multi-Label Classification with Dual-Modal Decoder on
Aligned Visual-Textual Features
- Authors: Shichao Xu, Yikang Li, Jenhao Hsiao, Chiuman Ho, Zhu Qi
- Abstract summary: We propose a novel algorithm, Aligned Dual moDality ClaSsifier (ADDS), which includes a Dual-Modal decoder (DM-decoder) with alignment between visual and textual features.
Extensive experiments conducted on several standard benchmarks, NUS-WIDE, ImageNet-1k, ImageNet-21k, and MS-COCO, demonstrate that our approach significantly outperforms previous methods.
- Score: 14.334304670606633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In computer vision, multi-label recognition are important tasks with many
real-world applications, but classifying previously unseen labels remains a
significant challenge. In this paper, we propose a novel algorithm, Aligned
Dual moDality ClaSsifier (ADDS), which includes a Dual-Modal decoder
(DM-decoder) with alignment between visual and textual features, for
open-vocabulary multi-label classification tasks. Then we design a simple and
yet effective method called Pyramid-Forwarding to enhance the performance for
inputs with high resolutions. Moreover, the Selective Language Supervision is
applied to further enhance the model performance. Extensive experiments
conducted on several standard benchmarks, NUS-WIDE, ImageNet-1k, ImageNet-21k,
and MS-COCO, demonstrate that our approach significantly outperforms previous
methods and provides state-of-the-art performance for open-vocabulary
multi-label classification, conventional multi-label classification and an
extreme case called single-to-multi label classification where models trained
on single-label datasets (ImageNet-1k, ImageNet-21k) are tested on multi-label
ones (MS-COCO and NUS-WIDE).
Related papers
- Text-Region Matching for Multi-Label Image Recognition with Missing Labels [5.095488730708477]
TRM-ML is a novel method for enhancing meaningful cross-modal matching.
We propose a category prototype that leverages intra- and inter-category semantic relationships to estimate unknown labels.
Our proposed framework outperforms the state-of-the-art methods by a significant margin.
arXiv Detail & Related papers (2024-07-26T05:29:24Z) - UniDEC : Unified Dual Encoder and Classifier Training for Extreme Multi-Label Classification [42.36546066941635]
Extreme Multi-label Classification (XMC) involves predicting a subset of relevant labels from an extremely large label space.
This work proposes UniDEC, a novel end-to-end trainable framework which trains the dual encoder and classifier in together.
arXiv Detail & Related papers (2024-05-04T17:27:51Z) - Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized
Visual Class Discovery [69.91441987063307]
Generalized Category Discovery (GCD) aims to cluster unlabeled data from both known and unknown categories.
Current GCD methods rely on only visual cues, which neglect the multi-modality perceptive nature of human cognitive processes in discovering novel visual categories.
We propose a two-phase TextGCD framework to accomplish multi-modality GCD by exploiting powerful Visual-Language Models.
arXiv Detail & Related papers (2024-03-12T07:06:50Z) - DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition
with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance.
We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs.
We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z) - DICNet: Deep Instance-Level Contrastive Network for Double Incomplete
Multi-View Multi-Label Classification [20.892833511657166]
Multi-view multi-label data in the real world is commonly incomplete due to the uncertain factors of data collection and manual annotation.
We propose a deep instance-level contrastive network, namely DICNet, to deal with the double incomplete multi-view multi-label classification problem.
Our DICNet is adept in capturing consistent discriminative representations of multi-view multi-label data and avoiding the negative effects of missing views and missing labels.
arXiv Detail & Related papers (2023-03-15T04:24:01Z) - Texts as Images in Prompt Tuning for Multi-Label Image Recognition [70.9310322461598]
We advocate that image-text contrastive learning makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting.
Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning.
Our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-11-23T07:00:11Z) - DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited
Annotations [61.41339201200135]
We propose Dual Context Optimization (DualCoOp) as a unified framework for partial-label MLR and zero-shot MLR.
Since DualCoOp only introduces a very light learnable overhead upon the pretrained vision-language framework, it can quickly adapt to multi-label recognition tasks.
arXiv Detail & Related papers (2022-06-20T02:36:54Z) - Dual-Perspective Semantic-Aware Representation Blending for Multi-Label
Image Recognition with Partial Labels [70.36722026729859]
We propose a dual-perspective semantic-aware representation blending (DSRB) that blends multi-granularity category-specific semantic representation across different images.
The proposed DS consistently outperforms current state-of-the-art algorithms on all proportion label settings.
arXiv Detail & Related papers (2022-05-26T00:33:44Z) - Multi-Label Image Classification with Contrastive Learning [57.47567461616912]
We show that a direct application of contrastive learning can hardly improve in multi-label cases.
We propose a novel framework for multi-label classification with contrastive learning in a fully supervised setting.
arXiv Detail & Related papers (2021-07-24T15:00:47Z) - Learning Discriminative Representations for Multi-Label Image
Recognition [13.13795708478267]
We propose a unified deep network to learn discriminative features for the multi-label task.
By regularizing the whole network with the proposed loss, the performance of applying the wellknown ResNet-101 is improved significantly.
arXiv Detail & Related papers (2021-07-23T12:10:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.