Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering
- URL: http://arxiv.org/abs/2508.21773v1
- Date: Fri, 29 Aug 2025 16:49:03 GMT
- Title: Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering
- Authors: Nattapong Kurpukdee, Adrian G. Bors,
- Abstract summary: We propose a realistic scenario for the unsupervised video learning where neither task boundaries nor labels are provided when learning a succession of tasks.<n>We also provide a non-parametric learning solution for the under-explored problem of unsupervised video continual learning.
- Score: 47.53991869205973
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a realistic scenario for the unsupervised video learning where neither task boundaries nor labels are provided when learning a succession of tasks. We also provide a non-parametric learning solution for the under-explored problem of unsupervised video continual learning. Videos represent a complex and rich spatio-temporal media information, widely used in many applications, but which have not been sufficiently explored in unsupervised continual learning. Prior studies have only focused on supervised continual learning, relying on the knowledge of labels and task boundaries, while having labeled data is costly and not practical. To address this gap, we study the unsupervised video continual learning (uVCL). uVCL raises more challenges due to the additional computational and memory requirements of processing videos when compared to images. We introduce a general benchmark experimental protocol for uVCL by considering the learning of unstructured video data categories during each task. We propose to use the Kernel Density Estimation (KDE) of deep embedded video features extracted by unsupervised video transformer networks as a non-parametric probabilistic representation of the data. We introduce a novelty detection criterion for the incoming new task data, dynamically enabling the expansion of memory clusters, aiming to capture new knowledge when learning a succession of tasks. We leverage the use of transfer learning from the previous tasks as an initial state for the knowledge transfer to the current learning task. We found that the proposed methodology substantially enhances the performance of the model when successively learning many tasks. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, without using any labels or class boundaries.
Related papers
- Unsupervised Video Class-Incremental Learning via Deep Embedded Clustering Management [47.53991869205973]
Unsupervised video class incremental learning (uVCIL) represents an important learning paradigm for learning video information without forgetting.<n>We propose a simple yet effective approach to address the uVCIL.<n>We first consider a deep feature extractor network, providing a set of representative video features during each task without assuming any class or task information.
arXiv Detail & Related papers (2026-01-20T15:25:41Z) - Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval [53.54695034420311]
In practice, videos are typically untrimmed in long durations with much more complicated background content.<n>We propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model.<n>Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets.
arXiv Detail & Related papers (2025-10-14T08:38:20Z) - LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection [14.687867348598035]
Large Vision Language Model (LVLM) has become an emerging tool for AI-generated content detection.<n>We propose LAVID, a novel LVLMs-based ai-generated video detection with explicit knowledge enhancement.<n>Our proposed pipeline automatically selects a set of explicit knowledge tools for detection, and then adaptively adjusts the structure prompt by self-rewriting.
arXiv Detail & Related papers (2025-02-20T19:34:58Z) - Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation [57.34255010956452]
This work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective.<n>We propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data.<n>Our proposed method achieves performance comparable to or even superior to that of baselines trained with significantly more samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z) - COOLer: Class-Incremental Learning for Appearance-Based Multiple Object
Tracking [32.47215340215641]
This paper extends the scope of continual learning research to class-incremental learning for multiple object tracking (MOT)
Previous solutions for continual learning of object detectors do not address the data association stage of appearance-based trackers.
We introduce COOLer, a COntrastive- and cOntinual-Learning-based tracker, which incrementally learns to track new categories while preserving past knowledge.
arXiv Detail & Related papers (2023-10-04T17:49:48Z) - Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z) - vCLIMB: A Novel Video Class Incremental Learning Benchmark [53.90485760679411]
We introduce vCLIMB, a novel video continual learning benchmark.
vCLIMB is a standardized test-bed to analyze catastrophic forgetting of deep models in video continual learning.
We propose a temporal consistency regularization that can be applied on top of memory-based continual learning methods.
arXiv Detail & Related papers (2022-01-23T22:14:17Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Anomaly Detection in Video via Self-Supervised and Multi-Task Learning [113.81927544121625]
Anomaly detection in video is a challenging computer vision problem.
In this paper, we approach anomalous event detection in video through self-supervised and multi-task learning at the object level.
arXiv Detail & Related papers (2020-11-15T10:21:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.