CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image
Classification
- URL: http://arxiv.org/abs/2307.16634v2
- Date: Thu, 7 Mar 2024 05:05:15 GMT
- Title: CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image
Classification
- Authors: Rabab Abdelfattah, Qing Guo, Xiaoguang Li, Xiaofeng Wang, and Song
Wang
- Abstract summary: This paper presents a CLIP-based unsupervised learning method for annotation-free multi-label image classification.
We take full advantage of the powerful CLIP model and propose a novel approach to extend CLIP for multi-label predictions based on global-local image-text similarity aggregation.
Our method outperforms state-of-the-art unsupervised methods on MS-COCO, PASCAL VOC 2007, PASCAL VOC 2012, and NUS datasets.
- Score: 23.392746466420128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a CLIP-based unsupervised learning method for
annotation-free multi-label image classification, including three stages:
initialization, training, and inference. At the initialization stage, we take
full advantage of the powerful CLIP model and propose a novel approach to
extend CLIP for multi-label predictions based on global-local image-text
similarity aggregation. To be more specific, we split each image into snippets
and leverage CLIP to generate the similarity vector for the whole image
(global) as well as each snippet (local). Then a similarity aggregator is
introduced to leverage the global and local similarity vectors. Using the
aggregated similarity scores as the initial pseudo labels at the training
stage, we propose an optimization framework to train the parameters of the
classification network and refine pseudo labels for unobserved labels. During
inference, only the classification network is used to predict the labels of the
input image. Extensive experiments show that our method outperforms
state-of-the-art unsupervised methods on MS-COCO, PASCAL VOC 2007, PASCAL VOC
2012, and NUS datasets and even achieves comparable results to weakly
supervised classification methods.
Related papers
- CLIP-Decoder : ZeroShot Multilabel Classification using Multimodal CLIP Aligned Representation [12.994898879803642]
The CLIP-Decoder is a novel method based on the state-of-the-art ML-Decoder attention-based head.
We introduce multi-modal representation learning in CLIP-Decoder, utilizing the text encoder to extract text features and the image encoder for image feature extraction.
Our method achieves an absolute increase of 3.9% in performance compared to existing methods for zero-shot learning multi-label classification tasks.
arXiv Detail & Related papers (2024-06-21T02:19:26Z) - TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary
Multi-Label Classification of CLIP Without Training [29.431698321195814]
Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification.
CLIP shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class.
We propose a local-to-global framework to obtain image tags.
arXiv Detail & Related papers (2023-12-20T08:15:40Z) - Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task.
We propose a co-training-based framework that encourages clustering consistency.
Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - ISLE: A Framework for Image Level Semantic Segmentation Ensemble [5.137284292672375]
Conventional semantic segmentation networks require massive pixel-wise annotated labels to reach state-of-the-art prediction quality.
We propose ISLE, which employs an ensemble of the "pseudo-labels" for a given set of different semantic segmentation techniques on a class-wise level.
We reach up to 2.4% improvement over ISLE's individual components.
arXiv Detail & Related papers (2023-03-14T13:36:36Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Zero-Shot Recognition through Image-Guided Semantic Classification [9.291055558504588]
We present a new embedding-based framework for zero-shot learning (ZSL)
Motivated by the binary relevance method for multi-label classification, we propose to inversely learn the mapping between an image and a semantic classifier.
IGSC is conceptually simple and can be realized by a slight enhancement of an existing deep architecture for classification.
arXiv Detail & Related papers (2020-07-23T06:22:40Z) - Joint Visual and Temporal Consistency for Unsupervised Domain Adaptive
Person Re-Identification [64.37745443119942]
This paper jointly enforces visual and temporal consistency in the combination of a local one-hot classification and a global multi-class classification.
Experimental results on three large-scale ReID datasets demonstrate the superiority of proposed method in both unsupervised and unsupervised domain adaptive ReID tasks.
arXiv Detail & Related papers (2020-07-21T14:31:27Z) - Fine-Grained Visual Classification with Efficient End-to-end
Localization [49.9887676289364]
We present an efficient localization module that can be fused with a classification network in an end-to-end setup.
We evaluate the new model on the three benchmark datasets CUB200-2011, Stanford Cars and FGVC-Aircraft.
arXiv Detail & Related papers (2020-05-11T14:07:06Z) - Unsupervised Person Re-identification via Multi-label Classification [55.65870468861157]
This paper formulates unsupervised person ReID as a multi-label classification task to progressively seek true labels.
Our method starts by assigning each person image with a single-class label, then evolves to multi-label classification by leveraging the updated ReID model for label prediction.
To boost the ReID model training efficiency in multi-label classification, we propose the memory-based multi-label classification loss (MMCL)
arXiv Detail & Related papers (2020-04-20T12:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.