Memory Storyboard: Leveraging Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos
- URL: http://arxiv.org/abs/2501.12254v1
- Date: Tue, 21 Jan 2025 16:19:38 GMT
- Title: Memory Storyboard: Leveraging Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos
- Authors: Yanlai Yang, Mengye Ren,
- Abstract summary: We investigate streaming self-supervised learning from long-form real-world egocentric video streams.
Inspired by the event segmentation mechanism in human perception and memory, we propose "Memory Storyboard"
To accommodate efficient temporal segmentation, we propose a two-tier memory hierarchy.
- Score: 13.687045169487774
- License:
- Abstract: Self-supervised learning holds the promise to learn good representations from real-world continuous uncurated data streams. However, most existing works in visual self-supervised learning focus on static images or artificial data streams. Towards exploring a more realistic learning substrate, we investigate streaming self-supervised learning from long-form real-world egocentric video streams. Inspired by the event segmentation mechanism in human perception and memory, we propose "Memory Storyboard" that groups recent past frames into temporal segments for more effective summarization of the past visual streams for memory replay. To accommodate efficient temporal segmentation, we propose a two-tier memory hierarchy: the recent past is stored in a short-term memory, and the storyboard temporal segments are then transferred to a long-term memory. Experiments on real-world egocentric video datasets including SAYCam and KrishnaCam show that contrastive learning objectives on top of storyboard frames result in semantically meaningful representations which outperform those produced by state-of-the-art unsupervised continual learning methods.
Related papers
- HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning [9.899703354116962]
Interest in dense video captioning (DVC) has been on the rise.
Several studies highlight the challenges of utilizing prior knowledge, such as pre-training and external memory.
We propose a model that leverages the prior knowledge of human-oriented hierarchical compact memory.
arXiv Detail & Related papers (2024-12-19T07:06:25Z) - StreamMOS: Streaming Moving Object Segmentation with Multi-View Perception and Dual-Span Memory [21.300636683882338]
We propose a streaming network with a memory mechanism, called StreamMOS, to build the association of features and predictions among multiple inferences.
Specifically, we utilize a short-term memory to convey historical features, which can be regarded as spatial prior to moving objects.
We also present multi-view encoder with projection and asymmetric convolution to extract motion feature of objects in different representations.
arXiv Detail & Related papers (2024-07-25T09:51:09Z) - Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement [56.26688591324508]
We provide the first systematic study of video distillation and introduce a taxonomy to categorize temporal compression.
Our investigation reveals that the temporal information is usually not well learned during distillation, and the temporal dimension of synthetic data contributes little.
Our method achieves state-of-the-art on video datasets at different scales, with a notably smaller memory storage budget.
arXiv Detail & Related papers (2023-12-01T05:59:08Z) - Saliency-Guided Hidden Associative Replay for Continual Learning [13.551181595881326]
Continual Learning is a burgeoning domain in next-generation AI, focusing on training neural networks over a sequence of tasks akin to human learning.
This paper presents the Saliency Guided Hidden Associative Replay for Continual Learning.
This novel framework synergizes associative memory with replay-based strategies. SHARC primarily archives salient data segments via sparse memory encoding.
arXiv Detail & Related papers (2023-10-06T15:54:12Z) - Black-box Unsupervised Domain Adaptation with Bi-directional
Atkinson-Shiffrin Memory [59.51934126717572]
Black-box unsupervised domain adaptation (UDA) learns with source predictions of target data without accessing either source data or source models during training.
We propose BiMem, a bi-directional memorization mechanism that learns to remember useful and representative information to correct noisy pseudo labels on the fly.
BiMem achieves superior domain adaptation performance consistently across various visual recognition tasks such as image classification, semantic segmentation and object detection.
arXiv Detail & Related papers (2023-08-25T08:06:48Z) - Just a Glimpse: Rethinking Temporal Information for Video Continual
Learning [58.7097258722291]
We propose a novel replay mechanism for effective video continual learning based on individual/single frames.
Under extreme memory constraints, video diversity plays a more significant role than temporal information.
Our method achieves state-of-the-art performance, outperforming the previous state-of-the-art by up to 21.49%.
arXiv Detail & Related papers (2023-05-28T19:14:25Z) - Saliency-Augmented Memory Completion for Continual Learning [8.243137410556495]
How to forget is a problem continual learning must address.
Our paper proposes a new saliency-augmented memory completion framework for continual learning.
arXiv Detail & Related papers (2022-12-26T18:06:39Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Memory-augmented Dense Predictive Coding for Video Representation
Learning [103.69904379356413]
We propose a new architecture and learning framework Memory-augmented Predictive Coding (MemDPC) for the task.
We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both.
In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
arXiv Detail & Related papers (2020-08-03T17:57:01Z) - Video Object Segmentation with Episodic Graph Memory Networks [198.74780033475724]
A graph memory network is developed to address the novel idea of "learning to update the segmentation model"
We exploit an episodic memory network, organized as a fully connected graph, to store frames as nodes and capture cross-frame correlations by edges.
The proposed graph memory network yields a neat yet principled framework, which can generalize well both one-shot and zero-shot video object segmentation tasks.
arXiv Detail & Related papers (2020-07-14T13:19:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.