FreshMem: Brain-Inspired Frequency-Space Hybrid Memory for Streaming Video Understanding
- URL: http://arxiv.org/abs/2602.01683v1
- Date: Mon, 02 Feb 2026 05:52:11 GMT
- Title: FreshMem: Brain-Inspired Frequency-Space Hybrid Memory for Streaming Video Understanding
- Authors: Kangcong Li, Peng Ye, Lin Zhang, Chao Wang, Huafeng Qin, Tao Chen,
- Abstract summary: We propose FreshMem, a Frequency-Space Hybrid Memory network inspired by the brain's logarithmic perception and memory consolidation.<n>FreshMem reconciles short-term fidelity with long-term coherence through two synergistic modules.<n>Experiments show that FreshMem significantly boosts the Qwen2-VL baseline, yielding gains of 5.20%, 4.52%, and 2.34% on StreamingBench, OV-Bench, and OVO-Bench, respectively.
- Score: 16.693006630166316
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transitioning Multimodal Large Language Models (MLLMs) from offline to online streaming video understanding is essential for continuous perception. However, existing methods lack flexible adaptivity, leading to irreversible detail loss and context fragmentation. To resolve this, we propose FreshMem, a Frequency-Space Hybrid Memory network inspired by the brain's logarithmic perception and memory consolidation. FreshMem reconciles short-term fidelity with long-term coherence through two synergistic modules: Multi-scale Frequency Memory (MFM), which projects overflowing frames into representative frequency coefficients, complemented by residual details to reconstruct a global historical "gist"; and Space Thumbnail Memory (STM), which discretizes the continuous stream into episodic clusters by employing an adaptive compression strategy to distill them into high-density space thumbnails. Extensive experiments show that FreshMem significantly boosts the Qwen2-VL baseline, yielding gains of 5.20%, 4.52%, and 2.34% on StreamingBench, OV-Bench, and OVO-Bench, respectively. As a training-free solution, FreshMem outperforms several fully fine-tuned methods, offering a highly efficient paradigm for long-horizon streaming video understanding.
Related papers
- From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents [78.30630000529133]
We propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory.<n> MM-Mem memory structures hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic.<n>Experiments confirm the effectiveness of MM-Mem on both offline and streaming tasks.
arXiv Detail & Related papers (2026-03-02T05:12:45Z) - See More, Store Less: Memory-Efficient Resolution for Video Moment Retrieval [5.835635134105812]
We propose SMORE (See MORE, store less), a framework that enhances memory efficiency while maintaining high information resolution.<n>SMORE (1) uses query-guided captions to encode semantics aligned with user intent, (2) applies query-aware importance modulation to highlight relevant segments, and (3) adaptively compresses frames to preserve key content.<n> Experimental validation reveals that SMORE achieves state-of-the-art performance on QVHighlights, Charades-STA, and ActivityNet-Captions benchmarks.
arXiv Detail & Related papers (2026-01-14T10:28:11Z) - VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management [17.645183933549458]
VideoMem is a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management.<n>We show that VideoMem significantly outperforms existing open-source models across diverse benchmarks for ultra-long video understanding tasks.
arXiv Detail & Related papers (2025-12-04T07:42:13Z) - VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory [42.2374676860638]
Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally.<n>Maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition.<n>We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory.
arXiv Detail & Related papers (2025-12-04T07:06:02Z) - WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning [66.24870234484668]
We introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories.<n>WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks.
arXiv Detail & Related papers (2025-12-02T05:14:52Z) - video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory [51.03819128505358]
Video-SALMONN S is first to process 3-hour videos at 1 FPS and 360p resolution under a fixed memory budget.<n>A test-time-training memory module continually updates token representations to capture long-range dependencies.<n>A prompt-dependent memory reader retrieves context-relevant content from fixed-size memory.
arXiv Detail & Related papers (2025-10-13T08:20:15Z) - mGRADE: Minimal Recurrent Gating Meets Delay Convolutions for Lightweight Sequence Modeling [0.5236468296934584]
mGRADE is a hybrid-memory system that integrates a temporal 1D-convolution with learnable spacings followed by a minimal gated recurrent unit.<n>We demonstrate that mGRADE effectively separates and preserves multi-scale temporal features.<n>This highlights mGRADE's promise as an efficient solution for memory-constrained multi-scale temporal processing at the edge.
arXiv Detail & Related papers (2025-07-02T15:44:35Z) - AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding [55.320254859515714]
Multimodal Large Language Models (MLLMs) have revolutionized video understanding, yet are still limited by context length when processing long videos.<n>We propose AdaReTaKe, a training-free method that flexibly reduces visual redundancy by allocating compression ratios among time and layers with theoretical guarantees.<n>Experiments on VideoMME, MLVU, LongVideoBench, and LVBench datasets demonstrate that AdaReTaKe outperforms existing methods by 2.3% and 2.8% for 7B and 72B models, respectively.
arXiv Detail & Related papers (2025-03-16T16:14:52Z) - InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.90258030688256]
This project introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input.<n>This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
arXiv Detail & Related papers (2024-12-12T18:58:30Z) - Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.<n>Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.<n>By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.