SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery
- URL: http://arxiv.org/abs/2602.17395v1
- Date: Thu, 19 Feb 2026 14:18:50 GMT
- Title: SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery
- Authors: Lorenzo Caselli, Marco Mistretta, Simone Magistri, Andrew D. Bagdanov,
- Abstract summary: Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes.<n>We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross-modal image-concept similarities as a unified cross-modal representation.<n>SpectralGCD delivers accuracy comparable to or significantly superior to state-of-the-art methods at a fraction of the computational cost.
- Score: 14.526295398233747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes. Training a parametric classifier solely on image features often leads to overfitting to old classes, and recent multimodal approaches improve performance by incorporating textual information. However, they treat modalities independently and incur high computational cost. We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross-modal image-concept similarities as a unified cross-modal representation. Each image is expressed as a mixture over semantic concepts from a large task-agnostic dictionary, which anchors learning to explicit semantics and reduces reliance on spurious visual cues. To maintain the semantic quality of representations learned by an efficient student, we introduce Spectral Filtering which exploits a cross-modal covariance matrix over the softmaxed similarities measured by a strong teacher model to automatically retain only relevant concepts from the dictionary. Forward and reverse knowledge distillation from the same teacher ensures that the cross-modal representations of the student remain both semantically sufficient and well-aligned. Across six benchmarks, SpectralGCD delivers accuracy comparable to or significantly superior to state-of-the-art methods at a fraction of the computational cost. The code is publicly available at: https://github.com/miccunifi/SpectralGCD.
Related papers
- Delving into Spectral Clustering with Vision-Language Representations [27.433418706301477]
We propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models.<n>We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters.<n>Our method consistently outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2026-02-10T09:36:24Z) - Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition [55.189113121465816]
We propose a novel correlation adaptation prompt network (CAPNET) for long-tailed multi-label visual recognition.<n>CAPNET explicitly models correlations from CLIP's textual encoder.<n>It improves generalization through test-time ensembling and realigns visual-textual modalities.
arXiv Detail & Related papers (2025-11-25T18:57:28Z) - Hierarchical Identity Learning for Unsupervised Visible-Infrared Person Re-Identification [81.3063589622217]
Unsupervised visible-infrared person re-identification (USVI-ReID) aims to learn modality-invariant image features from unlabeled cross-modal person datasets.
arXiv Detail & Related papers (2025-09-15T05:10:43Z) - The Demon is in Ambiguity: Revisiting Situation Recognition with Single Positive Multi-Label Learning [30.485929387603463]
Context recognition is a fundamental task in computer vision that aims to extract structured semantic summaries from images.<n>Existing methods treat verb classification as a single-label problem, but we show through a comprehensive analysis that this formulation fails to address the inherent ambiguity in visual event recognition.<n>This paper makes three key contributions: First, we reveal through empirical analysis that verb classification is inherently a multi-label problem due to the ubiquitous semantic overlap between verb categories.<n>Second, given the impracticality of fully annotating large-scale datasets with multiple labels, we propose to reformulate verb classification as a single positive multi-label learning
arXiv Detail & Related papers (2025-08-29T17:51:55Z) - Semantic-Aware Representation Learning via Conditional Transport for Multi-Label Image Classification [8.864897133482907]
This paper proposes a novel approach named Semantic-aware representation learning via Conditional Transport for Multi-Label Image Classification (SCT)<n>The proposed method introduces a semantic-related feature learning module that extracts discriminative label-specific features by emphasizing semantic relevance and interaction.<n>Experiments on two widely-used benchmark datasets, VOC2007 and MS-COCO, validate the effectiveness of SCT and demonstrate its superior performance compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2025-07-20T11:15:24Z) - GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning [51.677086019209554]
We propose a Generalized Structural Sparse to capture powerful relationships across modalities for pair-wise similarity learning.
The distance metric delicately encapsulates two formats of diagonal and block-diagonal terms.
Experiments on cross-modal and two extra uni-modal retrieval tasks have validated its superiority and flexibility.
arXiv Detail & Related papers (2024-10-20T03:45:50Z) - Fine-grained Recognition with Learnable Semantic Data Augmentation [68.48892326854494]
Fine-grained image recognition is a longstanding computer vision challenge.
We propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem.
Our method significantly improves the generalization performance on several popular classification networks.
arXiv Detail & Related papers (2023-09-01T11:15:50Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Self-supervised Contrastive Learning for Cross-domain Hyperspectral
Image Representation [26.610588734000316]
This paper introduces a self-supervised learning framework suitable for hyperspectral images that are inherently challenging to annotate.
The proposed framework architecture leverages cross-domain CNN, allowing for learning representations from different hyperspectral images.
The experimental results demonstrate the advantage of the proposed self-supervised representation over models trained from scratch or other transfer learning methods.
arXiv Detail & Related papers (2022-02-08T16:16:45Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.