Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized
Visual Class Discovery
- URL: http://arxiv.org/abs/2403.07369v1
- Date: Tue, 12 Mar 2024 07:06:50 GMT
- Title: Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized
Visual Class Discovery
- Authors: Haiyang Zheng, Nan Pu, Wenjing Li, Nicu Sebe, Zhun Zhong
- Abstract summary: Generalized Category Discovery (GCD) aims to cluster unlabeled data from both known and unknown categories.
Current GCD methods rely on only visual cues, which neglect the multi-modality perceptive nature of human cognitive processes in discovering novel visual categories.
We propose a two-phase TextGCD framework to accomplish multi-modality GCD by exploiting powerful Visual-Language Models.
- Score: 69.91441987063307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study the problem of Generalized Category Discovery (GCD),
which aims to cluster unlabeled data from both known and unknown categories
using the knowledge of labeled data from known categories. Current GCD methods
rely on only visual cues, which however neglect the multi-modality perceptive
nature of human cognitive processes in discovering novel visual categories. To
address this, we propose a two-phase TextGCD framework to accomplish
multi-modality GCD by exploiting powerful Visual-Language Models. TextGCD
mainly includes a retrieval-based text generation (RTG) phase and a
cross-modality co-teaching (CCT) phase. First, RTG constructs a visual lexicon
using category tags from diverse datasets and attributes from Large Language
Models, generating descriptive texts for images in a retrieval manner. Second,
CCT leverages disparities between textual and visual modalities to foster
mutual learning, thereby enhancing visual GCD. In addition, we design an
adaptive class aligning strategy to ensure the alignment of category
perceptions between modalities as well as a soft-voting mechanism to integrate
multi-modality cues. Experiments on eight datasets show the large superiority
of our approach over state-of-the-art methods. Notably, our approach
outperforms the best competitor, by 7.7% and 10.8% in All accuracy on
ImageNet-1k and CUB, respectively.
Related papers
- Dual-Level Cross-Modal Contrastive Clustering [4.083185193413678]
We propose a novel image clustering framwork, named Dual-level Cross-Modal Contrastive Clustering (DXMC)
external textual information is introduced for constructing a semantic space which is adopted to generate image-text pairs.
The image-text pairs are respectively sent to pre-trained image and text encoder to obtain image and text embeddings which subsquently are fed into four well-designed networks.
arXiv Detail & Related papers (2024-09-06T18:49:45Z) - Contextuality Helps Representation Learning for Generalized Category Discovery [5.885208652383516]
This paper introduces a novel approach to Generalized Category Discovery (GCD) by leveraging the concept of contextuality.
Our model integrates two levels of contextuality: instance-level, where nearest-neighbor contexts are utilized for contrastive learning, and cluster-level, employing contrastive learning.
The integration of the contextual information effectively improves the feature learning and thereby the classification accuracy of all categories.
arXiv Detail & Related papers (2024-07-29T07:30:41Z) - Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery [50.564146730579424]
We propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings for unlabelled samples.
Our method unlocks the multi-modal potentials of CLIP and outperforms the baseline methods by a large margin on all GCD benchmarks.
arXiv Detail & Related papers (2024-03-15T02:40:13Z) - Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task.
We propose a co-training-based framework that encourages clustering consistency.
Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z) - CLIP-GCD: Simple Language Guided Generalized Category Discovery [21.778676607030253]
Generalized Category Discovery (GCD) requires a model to both classify known categories and cluster unknown categories in unlabeled data.
Prior methods leveraged self-supervised pre-training combined with supervised fine-tuning on the labeled data, followed by simple clustering methods.
We propose to leverage multi-modal (vision and language) models, in two complementary ways.
arXiv Detail & Related papers (2023-05-17T17:55:33Z) - Dynamic Conceptional Contrastive Learning for Generalized Category
Discovery [76.82327473338734]
Generalized category discovery (GCD) aims to automatically cluster partially labeled data.
Unlabeled data contain instances that are not only from known categories of the labeled data but also from novel categories.
One effective way for GCD is applying self-supervised learning to learn discriminate representation for unlabeled data.
We propose a Dynamic Conceptional Contrastive Learning framework, which can effectively improve clustering accuracy.
arXiv Detail & Related papers (2023-03-30T14:04:39Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - Semantic Representation and Dependency Learning for Multi-Label Image
Recognition [76.52120002993728]
We propose a novel and effective semantic representation and dependency learning (SRDL) framework to learn category-specific semantic representation for each category.
Specifically, we design a category-specific attentional regions (CAR) module to generate channel/spatial-wise attention matrices to guide model.
We also design an object erasing (OE) module to implicitly learn semantic dependency among categories by erasing semantic-aware regions.
arXiv Detail & Related papers (2022-04-08T00:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.