Related papers: CRAM: Large-scale Video Continual Learning with Bootstrapped Compression

CRAM: Large-scale Video Continual Learning with Bootstrapped Compression

URL: http://arxiv.org/abs/2508.05001v1
Date: Thu, 07 Aug 2025 03:32:20 GMT
Title: CRAM: Large-scale Video Continual Learning with Bootstrapped Compression
Authors: Shivani Mall, Joao F. Henriques,
Abstract summary: Continual learning (CL) promises to allow neural networks to learn from continuous streams of inputs, instead of IID (independent and identically distributed) sampling.<n>We focus on video CL employing a rehearsal-based approach, which reinforces past samples from a memory buffer.<n>We propose a scheme to deal with this forgetting by refreshing video codes, which requires careful decompression with a previous version of the network and recompression with a new one.
Score: 4.304743423080129
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Continual learning (CL) promises to allow neural networks to learn from continuous streams of inputs, instead of IID (independent and identically distributed) sampling, which requires random access to a full dataset. This would allow for much smaller storage requirements and self-sufficiency of deployed systems that cope with natural distribution shifts, similarly to biological learning. We focus on video CL employing a rehearsal-based approach, which reinforces past samples from a memory buffer. We posit that part of the reason why practical video CL is challenging is the high memory requirements of video, further exacerbated by long-videos and continual streams, which are at odds with the common rehearsal-buffer size constraints. To address this, we propose to use compressed vision, i.e. store video codes (embeddings) instead of raw inputs, and train a video classifier by IID sampling from this rolling buffer. Training a video compressor online (so not depending on any pre-trained networks) means that it is also subject to catastrophic forgetting. We propose a scheme to deal with this forgetting by refreshing video codes, which requires careful decompression with a previous version of the network and recompression with a new one. We name our method Continually Refreshed Amodal Memory (CRAM). We expand current video CL benchmarks to large-scale settings, namely EpicKitchens-100 and Kinetics-700, storing thousands of relatively long videos in under 2 GB, and demonstrate empirically that our video CL method outperforms prior art with a significantly reduced memory footprint.

Related papers

Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory [5.311777874655448]
Long-Video Memory Network, Long-VMNet, is a novel video understanding method.<n>Long-VMNet achieves improved efficiency by leveraging a neural sampler that identifies discriminative tokens.<n>Our results on the Rest-ADL dataset demonstrate an 18x -- 75x improvement in inference times for long-form video retrieval and answering questions.
arXiv Detail & Related papers (2025-03-17T20:25:41Z)
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding [55.320254859515714]
ReTaKe enables VideoLLMs to process 8 times longer frames (up to 2048), similar-sized models by 3-5% and even rivaling much larger ones on VideoMME, MLVU, LongVideoBench, and LVBench.<n>Our code is available at https://github.com/SCZwangxiao/video-ReTaKe.
arXiv Detail & Related papers (2024-12-29T15:42:24Z)
Hierarchical Memory for Long Video QA [78.72965584414368]
This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA)<n>We adopt a hierarchical memory mechanism named STAR Memory, that is capable of processing long videos with limited GPU memory (VRAM)<n>We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge.
arXiv Detail & Related papers (2024-06-30T06:08:12Z)
Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding. It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected. Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z)
Just a Glimpse: Rethinking Temporal Information for Video Continual Learning [58.7097258722291]
We propose a novel replay mechanism for effective video continual learning based on individual/single frames. Under extreme memory constraints, video diversity plays a more significant role than temporal information. Our method achieves state-of-the-art performance, outperforming the previous state-of-the-art by up to 21.49%.
arXiv Detail & Related papers (2023-05-28T19:14:25Z)
READMem: Robust Embedding Association for a Diverse Memory in Unconstrained Video Object Segmentation [24.813416082160224]
We present READMem, a modular framework for sVOS methods to handle unconstrained videos. We propose a robust association of the embeddings stored in the memory with query embeddings during the update process. Our approach achieves competitive results on the Long-time Video dataset (LV1) while not hindering performance on short sequences.
arXiv Detail & Related papers (2023-05-22T08:31:16Z)
Compressed Vision for Efficient Video Understanding [83.97689018324732]
We propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks.
arXiv Detail & Related papers (2022-10-06T15:35:49Z)
Recurrent Dynamic Embedding for Video Object Segmentation [54.52527157232795]
We propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size. We propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos. We also design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank.
arXiv Detail & Related papers (2022-05-08T02:24:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.