CLIP-GCD: Simple Language Guided Generalized Category Discovery
- URL: http://arxiv.org/abs/2305.10420v1
- Date: Wed, 17 May 2023 17:55:33 GMT
- Title: CLIP-GCD: Simple Language Guided Generalized Category Discovery
- Authors: Rabah Ouldnoughi, Chia-Wen Kuo, Zsolt Kira
- Abstract summary: Generalized Category Discovery (GCD) requires a model to both classify known categories and cluster unknown categories in unlabeled data.
Prior methods leveraged self-supervised pre-training combined with supervised fine-tuning on the labeled data, followed by simple clustering methods.
We propose to leverage multi-modal (vision and language) models, in two complementary ways.
- Score: 21.778676607030253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalized Category Discovery (GCD) requires a model to both classify known
categories and cluster unknown categories in unlabeled data. Prior methods
leveraged self-supervised pre-training combined with supervised fine-tuning on
the labeled data, followed by simple clustering methods. In this paper, we
posit that such methods are still prone to poor performance on
out-of-distribution categories, and do not leverage a key ingredient: Semantic
relationships between object categories. We therefore propose to leverage
multi-modal (vision and language) models, in two complementary ways. First, we
establish a strong baseline by replacing uni-modal features with CLIP, inspired
by its zero-shot performance. Second, we propose a novel retrieval-based
mechanism that leverages CLIP's aligned vision-language representations by
mining text descriptions from a text corpus for the labeled and unlabeled set.
We specifically use the alignment between CLIP's visual encoding of the image
and textual encoding of the corpus to retrieve top-k relevant pieces of text
and incorporate their embeddings to perform joint image+text semi-supervised
clustering. We perform rigorous experimentation and ablations (including on
where to retrieve from, how much to retrieve, and how to combine information),
and validate our results on several datasets including out-of-distribution
domains, demonstrating state-of-art results.
Related papers
- African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification [53.89380284760555]
textttFOCI (textbfFine-grained textbfObject textbfClasstextbfIfication) is a difficult multiple-choice benchmark for fine-grained object classification.
textttFOCIxspace complements five popular classification datasets with four domain-specific subsets from ImageNet-21k.
arXiv Detail & Related papers (2024-06-20T16:59:39Z) - Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery [50.564146730579424]
We propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings for unlabelled samples.
Our method unlocks the multi-modal potentials of CLIP and outperforms the baseline methods by a large margin on all GCD benchmarks.
arXiv Detail & Related papers (2024-03-15T02:40:13Z) - Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized
Visual Class Discovery [69.91441987063307]
Generalized Category Discovery (GCD) aims to cluster unlabeled data from both known and unknown categories.
Current GCD methods rely on only visual cues, which neglect the multi-modality perceptive nature of human cognitive processes in discovering novel visual categories.
We propose a two-phase TextGCD framework to accomplish multi-modality GCD by exploiting powerful Visual-Language Models.
arXiv Detail & Related papers (2024-03-12T07:06:50Z) - Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task.
We propose a co-training-based framework that encourages clustering consistency.
Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z) - CEIL: A General Classification-Enhanced Iterative Learning Framework for
Text Clustering [16.08402937918212]
We propose a novel Classification-Enhanced Iterative Learning framework for short text clustering.
In each iteration, we first adopt a language model to retrieve the initial text representations.
After strict data filtering and aggregation processes, samples with clean category labels are retrieved, which serve as supervision information.
Finally, the updated language model with improved representation ability is used to enhance clustering in the next iteration.
arXiv Detail & Related papers (2023-04-20T14:04:31Z) - CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery [21.380021266251426]
generalized category discovery (GCD) considers the open-world problem of automatically clustering a partially labelled dataset.
In this paper, we address the GCD problem with an unknown category number for the unlabelled data.
We propose a framework, named CiPR, to bootstrap the representation by exploiting Cross-instance Positive Relations.
arXiv Detail & Related papers (2023-04-14T05:25:52Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - DocSCAN: Unsupervised Text Classification via Learning from Neighbors [2.2082422928825145]
We introduce DocSCAN, a completely unsupervised text classification approach using Semantic Clustering by Adopting Nearest-Neighbors (SCAN)
For each document, we obtain semantically informative vectors from a large pre-trained language model. Similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels.
Our learnable clustering approach uses pairs of neighboring datapoints as a weak learning signal. The proposed approach learns to assign classes to the whole dataset without provided ground-truth labels.
arXiv Detail & Related papers (2021-05-09T21:20:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.