TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary
Multi-Label Classification of CLIP Without Training
- URL: http://arxiv.org/abs/2312.12828v1
- Date: Wed, 20 Dec 2023 08:15:40 GMT
- Title: TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary
Multi-Label Classification of CLIP Without Training
- Authors: Yuqi Lin, Minghao Chen, Kaipeng Zhang, Hengjia Li, Mingming Li, Zheng
Yang, Dongqin Lv, Binbin Lin, Haifeng Liu, Deng Cai
- Abstract summary: Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification.
CLIP shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class.
We propose a local-to-global framework to obtain image tags.
- Score: 29.431698321195814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive
capabilities in open-vocabulary classification. The class token in the image
encoder is trained to capture the global features to distinguish different text
descriptions supervised by contrastive loss, making it highly effective for
single-label classification. However, it shows poor performance on multi-label
datasets because the global feature tends to be dominated by the most prominent
class and the contrastive nature of softmax operation aggravates it. In this
study, we observe that the multi-label classification results heavily rely on
discriminative local features but are overlooked by CLIP. As a result, we
dissect the preservation of patch-wise spatial information in CLIP and proposed
a local-to-global framework to obtain image tags. It comprises three steps: (1)
patch-level classification to obtain coarse scores; (2) dual-masking attention
refinement (DMAR) module to refine the coarse scores; (3) class-wise
reidentification (CWR) module to remedy predictions from a global perspective.
This framework is solely based on frozen CLIP and significantly enhances its
multi-label classification performance on various benchmarks without
dataset-specific training. Besides, to comprehensively assess the quality and
practicality of generated tags, we extend their application to the downstream
task, i.e., weakly supervised semantic segmentation (WSSS) with generated tags
as image-level pseudo labels. Experiments demonstrate that this
classify-then-segment paradigm dramatically outperforms other annotation-free
segmentation methods and validates the effectiveness of generated tags. Our
code is available at https://github.com/linyq2117/TagCLIP.
Related papers
- LayerMatch: Do Pseudo-labels Benefit All Layers? [77.59625180366115]
Semi-supervised learning offers a promising solution to mitigate the dependency of labeled data.
We develop two layer-specific pseudo-label strategies, termed Grad-ReLU and Avg-Clustering.
Our approach consistently demonstrates exceptional performance on standard semi-supervised learning benchmarks.
arXiv Detail & Related papers (2024-06-20T11:25:50Z) - Learning Label Hierarchy with Supervised Contrastive Learning [8.488965459026678]
Supervised contrastive learning (SCL) frameworks treat each class as independent and thus consider all classes to be equally important.
This paper introduces a family of Label-Aware SCL methods (LASCL) that incorporates hierarchical information to SCL by leveraging similarities between classes.
Experiments on three datasets show that the proposed LASCL works well on text classification of distinguishing a single label among multi-labels.
arXiv Detail & Related papers (2024-01-31T23:21:40Z) - CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image
Classification [23.392746466420128]
This paper presents a CLIP-based unsupervised learning method for annotation-free multi-label image classification.
We take full advantage of the powerful CLIP model and propose a novel approach to extend CLIP for multi-label predictions based on global-local image-text similarity aggregation.
Our method outperforms state-of-the-art unsupervised methods on MS-COCO, PASCAL VOC 2007, PASCAL VOC 2012, and NUS datasets.
arXiv Detail & Related papers (2023-07-31T13:12:02Z) - Learning Disentangled Label Representations for Multi-label
Classification [39.97251974500034]
One-shared-Feature-for-Multiple-Labels (OFML) is not conducive to learning discriminative label features.
We introduce the One-specific-Feature-for-One-Label (OFOL) mechanism and propose a novel disentangled label feature learning framework.
We achieve state-of-the-art performance on eight datasets.
arXiv Detail & Related papers (2022-12-02T21:49:34Z) - Scaling up Multi-domain Semantic Segmentation with Sentence Embeddings [81.09026586111811]
We propose an approach to semantic segmentation that achieves state-of-the-art supervised performance when applied in a zero-shot setting.
This is achieved by replacing each class label with a vector-valued embedding of a short paragraph that describes the class.
The resulting merged semantic segmentation dataset of over 2 Million images enables training a model that achieves performance equal to that of state-of-the-art supervised methods on 7 benchmark datasets.
arXiv Detail & Related papers (2022-02-04T07:19:09Z) - SCARF: Self-Supervised Contrastive Learning using Random Feature
Corruption [72.35532598131176]
We propose SCARF, a technique for contrastive learning, where views are formed by corrupting a random subset of features.
We show that SCARF complements existing strategies and outperforms alternatives like autoencoders.
arXiv Detail & Related papers (2021-06-29T08:08:33Z) - Generative Multi-Label Zero-Shot Learning [136.17594611722285]
Multi-label zero-shot learning strives to classify images into multiple unseen categories for which no data is available during training.
Our work is the first to tackle the problem of multi-label feature in the (generalized) zero-shot setting.
Our cross-level fusion-based generative approach outperforms the state-of-the-art on all three datasets.
arXiv Detail & Related papers (2021-01-27T18:56:46Z) - Joint Visual and Temporal Consistency for Unsupervised Domain Adaptive
Person Re-Identification [64.37745443119942]
This paper jointly enforces visual and temporal consistency in the combination of a local one-hot classification and a global multi-class classification.
Experimental results on three large-scale ReID datasets demonstrate the superiority of proposed method in both unsupervised and unsupervised domain adaptive ReID tasks.
arXiv Detail & Related papers (2020-07-21T14:31:27Z) - Unsupervised Person Re-identification via Multi-label Classification [55.65870468861157]
This paper formulates unsupervised person ReID as a multi-label classification task to progressively seek true labels.
Our method starts by assigning each person image with a single-class label, then evolves to multi-label classification by leveraging the updated ReID model for label prediction.
To boost the ReID model training efficiency in multi-label classification, we propose the memory-based multi-label classification loss (MMCL)
arXiv Detail & Related papers (2020-04-20T12:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.