ESSENTIAL: Episodic and Semantic Memory Integration for Video Class-Incremental Learning
- URL: http://arxiv.org/abs/2508.10896v1
- Date: Thu, 14 Aug 2025 17:59:07 GMT
- Title: ESSENTIAL: Episodic and Semantic Memory Integration for Video Class-Incremental Learning
- Authors: Jongseo Lee, Kyungho Bae, Kyle Min, Gyeong-Moon Park, Jinwoo Choi,
- Abstract summary: ESSENTIAL consists of episodic memory for storing temporally sparse features and semantic memory for storing general knowledge represented by learnable prompts.<n>We introduce a novel memory retrieval (MR) module that integrates episodic memory and semantic prompts through cross-attention.<n>Remarkably, with significantly reduced memory, ESSENTIAL achieves favorable performance on the benchmarks.
- Score: 8.958316945958025
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we tackle the problem of video classincremental learning (VCIL). Many existing VCIL methods mitigate catastrophic forgetting by rehearsal training with a few temporally dense samples stored in episodic memory, which is memory-inefficient. Alternatively, some methods store temporally sparse samples, sacrificing essential temporal information and thereby resulting in inferior performance. To address this trade-off between memory-efficiency and performance, we propose EpiSodic and SEmaNTIc memory integrAtion for video class-incremental Learning (ESSENTIAL). ESSENTIAL consists of episodic memory for storing temporally sparse features and semantic memory for storing general knowledge represented by learnable prompts. We introduce a novel memory retrieval (MR) module that integrates episodic memory and semantic prompts through cross-attention, enabling the retrieval of temporally dense features from temporally sparse features. We rigorously validate ESSENTIAL on diverse datasets: UCF-101, HMDB51, and Something-Something-V2 from the TCD benchmark and UCF-101, ActivityNet, and Kinetics-400 from the vCLIMB benchmark. Remarkably, with significantly reduced memory, ESSENTIAL achieves favorable performance on the benchmarks.
Related papers
- From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents [78.30630000529133]
We propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory.<n> MM-Mem memory structures hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic.<n>Experiments confirm the effectiveness of MM-Mem on both offline and streaming tasks.
arXiv Detail & Related papers (2026-03-02T05:12:45Z) - WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning [66.24870234484668]
We introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories.<n>WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks.
arXiv Detail & Related papers (2025-12-02T05:14:52Z) - Evaluating Long-Term Memory for Long-Context Question Answering [100.1267054069757]
We present a systematic evaluation of memory-augmented methods using LoCoMo, a benchmark of synthetic long-context dialogues annotated for question-answering tasks.<n>Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy.
arXiv Detail & Related papers (2025-10-27T18:03:50Z) - LightMem: Lightweight and Efficient Memory-Augmented Generation [72.21680105265824]
We introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems.<n>Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages.<n>Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x.
arXiv Detail & Related papers (2025-10-21T17:58:17Z) - Multiple Memory Systems for Enhancing the Long-term Memory of Agent [9.43633399280987]
Existing methods, such as MemoryBank and A-MEM, have poor quality of stored memory content.<n>We have designed a multiple memory system inspired by cognitive psychology theory.
arXiv Detail & Related papers (2025-08-21T06:29:42Z) - ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information.<n>We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity.<n>We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z) - Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning [64.93848182403116]
Current deep-learning memory models struggle in reinforcement learning environments that are partially observable and long-term.
We introduce the Stable Hadamard Memory, a novel memory model for reinforcement learning agents.
Our approach significantly outperforms state-of-the-art memory-based methods on challenging partially observable benchmarks.
arXiv Detail & Related papers (2024-10-14T03:50:17Z) - MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training [24.066283519769968]
Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications.<n>We propose MEMO, a novel framework for fine-grained activation memory management.<n>MeMO achieves an average of 1.97x and 1.80x MFU compared to Megatron-LM and DeepSpeed.
arXiv Detail & Related papers (2024-07-16T18:59:49Z) - Learning Quality-aware Dynamic Memory for Video Object Segmentation [32.06309833058726]
We propose a Quality-aware Dynamic Memory Network (QDMN) to evaluate the segmentation quality of each frame.
Our QDMN achieves new state-of-the-art performance on both DAVIS and YouTube-VOS benchmarks.
arXiv Detail & Related papers (2022-07-16T12:18:04Z) - XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin
Memory Model [137.50614198301733]
We present XMem, a video object segmentation architecture for long videos with unified feature memory stores.
We develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores.
XMem greatly exceeds state-of-the-art performance on long-video datasets.
arXiv Detail & Related papers (2022-07-14T17:59:37Z) - A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental
Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement.
We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work.
We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z) - Recurrent Dynamic Embedding for Video Object Segmentation [54.52527157232795]
We propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size.
We propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos.
We also design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank.
arXiv Detail & Related papers (2022-05-08T02:24:43Z) - Memory-Guided Semantic Learning Network for Temporal Sentence Grounding [55.31041933103645]
We propose a memory-augmented network that learns and memorizes the rarely appeared content in TSG tasks.
MGSL-Net consists of three main parts: a cross-modal inter-action module, a memory augmentation module, and a heterogeneous attention module.
arXiv Detail & Related papers (2022-01-03T02:32:06Z) - Sequential memory improves sample and memory efficiency in Episodic Control [0.0]
State of the art deep reinforcement learning algorithms are sample inefficient due to the large number of episodes they require to achieve performance.
ERL algorithms, inspired by the mammalian hippocampus, typically use extended memory systems to bootstrap learning from past events to overcome this sample-inefficiency problem.
Here, we demonstrate that including a bias in the acquired memory content derived from the order of episodic sampling improves both the sample and memory efficiency of an episodic control algorithm.
arXiv Detail & Related papers (2021-12-29T18:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.