Related papers: Video-based Generalized Category Discovery via Memory-Guided Consistency-Aware Contrastive Learning

Video-based Generalized Category Discovery via Memory-Guided Consistency-Aware Contrastive Learning

URL: http://arxiv.org/abs/2509.06306v1
Date: Mon, 08 Sep 2025 03:12:57 GMT
Title: Video-based Generalized Category Discovery via Memory-Guided Consistency-Aware Contrastive Learning
Authors: Zhang Jing, Pu Nan, Xie Yu Xiang, Guo Yanming, Lu Qianqi, Zou Shiwei, Yan Jie, Chen Yan,
Abstract summary: Generalized Category Discovery (GCD) is an emerging and challenging open-world problem.<n>Most existing GCD methods focus on discovering categories in static images.<n>We extend the GCD problem to the video domain and introduce a new setting, termed Video-GCD.
Score: 3.7666592096735587
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generalized Category Discovery (GCD) is an emerging and challenging open-world problem that has garnered increasing attention in recent years. Most existing GCD methods focus on discovering categories in static images. However, relying solely on static visual content is often insufficient to reliably discover novel categories. To bridge this gap, we extend the GCD problem to the video domain and introduce a new setting, termed Video-GCD. Thus, effectively integrating multi-perspective information across time is crucial for accurate Video-GCD. To tackle this challenge, we propose a novel Memory-guided Consistency-aware Contrastive Learning (MCCL) framework, which explicitly captures temporal-spatial cues and incorporates them into contrastive learning through a consistency-guided voting mechanism. MCCL consists of two core components: Consistency-Aware Contrastive Learning(CACL) and Memory-Guided Representation Enhancement (MGRE). CACL exploits multiperspective temporal features to estimate consistency scores between unlabeled instances, which are then used to weight the contrastive loss accordingly. MGRE introduces a dual-level memory buffer that maintains both feature-level and logit-level representations, providing global context to enhance intra-class compactness and inter-class separability. This in turn refines the consistency estimation in CACL, forming a mutually reinforcing feedback loop between representation learning and consistency modeling. To facilitate a comprehensive evaluation, we construct a new and challenging Video-GCD benchmark, which includes action recognition and bird classification video datasets. Extensive experiments demonstrate that our method significantly outperforms competitive GCD approaches adapted from image-based settings, highlighting the importance of temporal information for discovering novel categories in videos. The code will be publicly available.

Related papers

Preserving Cross-Modal Consistency for CLIP-based Class-Incremental Learning [77.82901519692378]
Class-incremental learning (CIL) enables models to continuously learn new categories from sequential tasks without forgetting previously acquired knowledge.<n>We propose DMC, a two-stage framework for CLIP-based CIL that decouples the adaptation of the vision encoder and the optimization of textual soft prompts.<n>Extensive experiments on CIFAR-100, Imagenet-R, CUB-200, and UCF-101 demonstrate that both DMC and DMC-OT achieve state-of-the-art performance.
arXiv Detail & Related papers (2025-11-14T05:36:36Z)
Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding [0.0]
We propose a framework that fuses video, image, and textcoding using GRU-based sequence encoders and cross-modal attention mechanisms.<n>Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines.
arXiv Detail & Related papers (2025-07-04T12:35:52Z)
Happy: A Debiased Learning Framework for Continual Generalized Category Discovery [54.54153155039062]
This paper explores the underexplored task of Continual Generalized Category Discovery (C-GCD) C-GCD aims to incrementally discover new classes from unlabeled data while maintaining the ability to recognize previously learned classes. We introduce a debiased learning framework, namely Happy, characterized by Hardness-aware prototype sampling and soft entropy regularization.
arXiv Detail & Related papers (2024-10-09T04:18:51Z)
PromptCCD: Learning Gaussian Mixture Prompt Pool for Continual Category Discovery [60.960147451219946]
Continual Category Discovery (CCD) aims to automatically discover novel categories in a continuous stream of unlabeled data. We propose PromptCCD, a framework that utilizes a Gaussian Mixture Model (GMM) as a prompting method for CCD. We extend the standard evaluation metric for Generalized Category Discovery (GCD) to CCD and benchmark state-of-the-art methods on diverse public datasets.
arXiv Detail & Related papers (2024-07-26T17:59:51Z)
Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery [65.16724941038052]
Generalized Category Discovery (GCD) aims to cluster unlabeled data from both known and unknown categories.<n>Current GCD methods rely on only visual cues, which neglect the multi-modality perceptive nature of human cognitive processes in discovering novel visual categories.<n>We propose a two-phase TextGCD framework to accomplish multi-modality GCD by exploiting powerful Visual-Language Models.
arXiv Detail & Related papers (2024-03-12T07:06:50Z)
Dynamic Conceptional Contrastive Learning for Generalized Category Discovery [76.82327473338734]
Generalized category discovery (GCD) aims to automatically cluster partially labeled data. Unlabeled data contain instances that are not only from known categories of the labeled data but also from novel categories. One effective way for GCD is applying self-supervised learning to learn discriminate representation for unlabeled data. We propose a Dynamic Conceptional Contrastive Learning framework, which can effectively improve clustering accuracy.
arXiv Detail & Related papers (2023-03-30T14:04:39Z)
Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training. Key to our approach is the use of both global and local temporal constraints. Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z)
vCLIMB: A Novel Video Class Incremental Learning Benchmark [53.90485760679411]
We introduce vCLIMB, a novel video continual learning benchmark. vCLIMB is a standardized test-bed to analyze catastrophic forgetting of deep models in video continual learning. We propose a temporal consistency regularization that can be applied on top of memory-based continual learning methods.
arXiv Detail & Related papers (2022-01-23T22:14:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.