Just a Glimpse: Rethinking Temporal Information for Video Continual
Learning
- URL: http://arxiv.org/abs/2305.18418v2
- Date: Wed, 28 Jun 2023 12:34:22 GMT
- Title: Just a Glimpse: Rethinking Temporal Information for Video Continual
Learning
- Authors: Lama Alssum, Juan Leon Alcazar, Merey Ramazanova, Chen Zhao, Bernard
Ghanem
- Abstract summary: We propose a novel replay mechanism for effective video continual learning based on individual/single frames.
Under extreme memory constraints, video diversity plays a more significant role than temporal information.
Our method achieves state-of-the-art performance, outperforming the previous state-of-the-art by up to 21.49%.
- Score: 58.7097258722291
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Class-incremental learning is one of the most important settings for the
study of Continual Learning, as it closely resembles real-world application
scenarios. With constrained memory sizes, catastrophic forgetting arises as the
number of classes/tasks increases. Studying continual learning in the video
domain poses even more challenges, as video data contains a large number of
frames, which places a higher burden on the replay memory. The current common
practice is to sub-sample frames from the video stream and store them in the
replay memory. In this paper, we propose SMILE a novel replay mechanism for
effective video continual learning based on individual/single frames. Through
extensive experimentation, we show that under extreme memory constraints, video
diversity plays a more significant role than temporal information. Therefore,
our method focuses on learning from a small number of frames that represent a
large number of unique videos. On three representative video datasets,
Kinetics, UCF101, and ActivityNet, the proposed method achieves
state-of-the-art performance, outperforming the previous state-of-the-art by up
to 21.49%.
Related papers
- ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning [29.620990627792906]
This paper presents a new self-supervised video representation learning framework, ARVideo, which autoregressively predicts the next video token in a tailored sequence order.
Extensive experiments establish ARVideo as an effective paradigm for self-supervised video representation learning.
arXiv Detail & Related papers (2024-05-24T02:29:03Z) - Learning Transferable Spatiotemporal Representations from Natural Script
Knowledge [65.40899722211726]
We introduce a new pretext task, Turning to Video Transcript for ASR (TVTS), which sorts scripts by attending to learned video representations.
The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world.
arXiv Detail & Related papers (2022-09-30T07:39:48Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - vCLIMB: A Novel Video Class Incremental Learning Benchmark [53.90485760679411]
We introduce vCLIMB, a novel video continual learning benchmark.
vCLIMB is a standardized test-bed to analyze catastrophic forgetting of deep models in video continual learning.
We propose a temporal consistency regularization that can be applied on top of memory-based continual learning methods.
arXiv Detail & Related papers (2022-01-23T22:14:17Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z) - Memory-augmented Dense Predictive Coding for Video Representation
Learning [103.69904379356413]
We propose a new architecture and learning framework Memory-augmented Predictive Coding (MemDPC) for the task.
We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both.
In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
arXiv Detail & Related papers (2020-08-03T17:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.