Open-Vocabulary Multi-Label Classification via Multi-modal Knowledge
Transfer
- URL: http://arxiv.org/abs/2207.01887v1
- Date: Tue, 5 Jul 2022 08:32:18 GMT
- Title: Open-Vocabulary Multi-Label Classification via Multi-modal Knowledge
Transfer
- Authors: Sunan He, Taian Guo, Tao Dai, Ruizhi Qiao, Bo Ren, Shu-Tao Xia
- Abstract summary: Multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding.
We propose a novel open-vocabulary framework, named multimodal knowledge transfer (MKT) for multi-label classification.
- Score: 55.885555581039895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-world recognition system often encounters a plenty of unseen labels in
practice. To identify such unseen labels, multi-label zero-shot learning
(ML-ZSL) focuses on transferring knowledge by a pre-trained textual label
embedding (e.g., GloVe). However, such methods only exploit singlemodal
knowledge from a language model, while ignoring the rich semantic information
inherent in image-text pairs. Instead, recently developed open-vocabulary (OV)
based methods succeed in exploiting such information of image-text pairs in
object detection, and achieve impressive performance. Inspired by the success
of OV-based methods, we propose a novel open-vocabulary framework, named
multimodal knowledge transfer (MKT), for multi-label classification.
Specifically, our method exploits multi-modal knowledge of image-text pairs
based on a vision and language pretraining (VLP) model. To facilitate
transferring the imagetext matching ability of VLP model, knowledge
distillation is used to guarantee the consistency of image and label
embeddings, along with prompt tuning to further update the label embeddings. To
further recognize multiple objects, a simple but effective two-stream module is
developed to capture both local and global features. Extensive experimental
results show that our method significantly outperforms state-of-theart methods
on public benchmark datasets. Code will be available at
https://github.com/seanhe97/MKT.
Related papers
- Text-Region Matching for Multi-Label Image Recognition with Missing Labels [5.095488730708477]
TRM-ML is a novel method for enhancing meaningful cross-modal matching.
We propose a category prototype that leverages intra- and inter-category semantic relationships to estimate unknown labels.
Our proposed framework outperforms the state-of-the-art methods by a significant margin.
arXiv Detail & Related papers (2024-07-26T05:29:24Z) - Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning [23.671999163027284]
This paper proposes a novel framework for multi-label image recognition without any training data.
It uses knowledge of pre-trained Large Language Model to learn prompts to adapt pretrained Vision-Language Model like CLIP to multilabel classification.
Our framework presents a new way to explore the synergies between multiple pre-trained models for novel category recognition.
arXiv Detail & Related papers (2024-03-02T13:43:32Z) - Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label
Classification [5.985859108787149]
Multi-label zero-shot learning is a non-trivial task in computer vision.
We propose a novel query-based knowledge sharing paradigm for this task.
Our framework significantly outperforms state-of-the-art methods on zero-shot task by 5.9% and 4.5% in mAP on the NUS-WIDE and Open Images.
arXiv Detail & Related papers (2024-01-02T12:18:40Z) - Multi-Label Knowledge Distillation [86.03990467785312]
We propose a novel multi-label knowledge distillation method.
On one hand, it exploits the informative semantic knowledge from the logits by dividing the multi-label learning problem into a set of binary classification problems.
On the other hand, it enhances the distinctiveness of the learned feature representations by leveraging the structural information of label-wise embeddings.
arXiv Detail & Related papers (2023-08-12T03:19:08Z) - Multi-Label Self-Supervised Learning with Scene Images [21.549234013998255]
This paper shows that quality image representations can be learned by treating scene/multi-label image SSL simply as a multi-label classification problem.
The proposed method is named Multi-Label Self-supervised learning (MLS)
arXiv Detail & Related papers (2023-08-07T04:04:22Z) - Texts as Images in Prompt Tuning for Multi-Label Image Recognition [70.9310322461598]
We advocate that image-text contrastive learning makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting.
Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning.
Our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-11-23T07:00:11Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - Unified Contrastive Learning in Image-Text-Label Space [130.31947133453406]
Unified Contrastive Learning (UniCL) is effective way of learning semantically rich yet discriminative representations.
UniCL stand-alone is a good learner on pure imagelabel data, rivaling the supervised learning methods across three image classification datasets.
arXiv Detail & Related papers (2022-04-07T17:34:51Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.