Related papers: Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

URL: http://arxiv.org/abs/2602.19910v1
Date: Mon, 23 Feb 2026 14:51:09 GMT
Title: Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery
Authors: Wei He, Xianghan Meng, Zhiyuan Huang, Xianbiao Qi, Rong Xiao, Chun-Guang Li,
Abstract summary: Generalized Category Discovery (GCD) aims to identify both known and unknown categories.<n>We propose a novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction.<n>We conduct extensive experiments on generic and fine-grained benchmark datasets demonstrating superior performance of our approach.
Score: 15.933337984000346
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging open-set recognition problem. State-of-the-art approaches for GCD task are usually built on multi-modality representation learning, which is heavily dependent upon inter-modality alignment. However, few of them cast a proper intra-modality alignment to generate a desired underlying structure of representation distributions. In this paper, we propose a novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR$^2$-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal alignment offered by Vision Language Models. We conduct extensive experiments on generic and fine-grained benchmark datasets demonstrating superior performance of our approach.

Related papers

Toward Effective Multimodal Graph Foundation Model: A Divide-and-Conquer Based Approach [42.970648490410504]
Multimodal Graph Foundation Models (MGFMs) allow for leveraging the rich multimodal information in Multimodal-Attributed Graphs (MAGs)<n>We propose PLANET, a novel framework employing a Divide-and-Conquer strategy to decouple modality interaction and alignment across distinct granularities.<n>We show that PLANET significantly outperforms state-of-the-art baselines across diverse graph-centric and multimodal generative tasks.
arXiv Detail & Related papers (2026-02-04T01:05:12Z)
Multi-Aspect Cross-modal Quantization for Generative Recommendation [27.92632297542123]
We propose Multi-Aspect Cross-modal quantization for generative Recommendation (MACRec)<n>We first introduce cross-modal quantization during the ID learning process, which effectively reduces conflict rates.<n>We also incorporate multi-aspect cross-modal alignments, including the implicit and explicit alignments.
arXiv Detail & Related papers (2025-11-19T04:55:14Z)
UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception [54.53657134205492]
UniAlignment is a unified multimodal generation framework within a single diffusion transformer.<n>It incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness.<n>We present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions.
arXiv Detail & Related papers (2025-09-28T09:11:30Z)
DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image.<n> Vision-Language Pre-training models offer a strong open-vocabulary foundation, but struggle with fine-grained localization under weak supervision.<n>We propose the Dual Adaptive Refinement Transfer (DART) framework to overcome these limitations.
arXiv Detail & Related papers (2025-08-07T17:22:33Z)
DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models [5.027492394254859]
DiSa is a Directional Saliency-Aware Prompt Learning framework.<n>It integrates two complementary regularization strategies to enhance generalization.<n>It consistently outperforms state-of-the-art prompt learning methods across various settings.
arXiv Detail & Related papers (2025-05-26T00:14:52Z)
Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
We propose a novel category-adaptive cross-modal semantic refinement and transfer (C$2$SRT) framework to explore the semantic correlation.<n>The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module.<n>Experiments on OV-MLR benchmarks clearly demonstrate that the proposed C$2$SRT framework outperforms current state-of-the-art algorithms.
arXiv Detail & Related papers (2024-12-09T04:00:18Z)
Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery [65.16724941038052]
Generalized Category Discovery (GCD) aims to cluster unlabeled data from both known and unknown categories.<n>Current GCD methods rely on only visual cues, which neglect the multi-modality perceptive nature of human cognitive processes in discovering novel visual categories.<n>We propose a two-phase TextGCD framework to accomplish multi-modality GCD by exploiting powerful Visual-Language Models.
arXiv Detail & Related papers (2024-03-12T07:06:50Z)
Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision. Existing literature addresses this challenge by employing local-based representation approaches. This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z)
Semantic Representation and Dependency Learning for Multi-Label Image Recognition [76.52120002993728]
We propose a novel and effective semantic representation and dependency learning (SRDL) framework to learn category-specific semantic representation for each category. Specifically, we design a category-specific attentional regions (CAR) module to generate channel/spatial-wise attention matrices to guide model. We also design an object erasing (OE) module to implicitly learn semantic dependency among categories by erasing semantic-aware regions.
arXiv Detail & Related papers (2022-04-08T00:55:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.