Related papers: FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

URL: http://arxiv.org/abs/2603.02096v1
Date: Mon, 02 Mar 2026 17:16:47 GMT
Title: FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding
Authors: Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, Zuxuan Wu,
Abstract summary: FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design.<n>It achieves new state-of-the-art results on existing online video benchmarks.<n>It maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.
Score: 49.23912975740968
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents FluxMem, a training-free framework for efficient streaming video understanding. FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design: (1) a Temporal Adjacency Selection (TAS) module removes redundant visual tokens across adjacent frames, and (2) a Spatial Domain Consolidation (SDC) module further merges spatially repetitive regions within each frame into compact representations. To adapt effectively to dynamic scenes, we introduce a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning. Extensive experiments demonstrate that FluxMem achieves new state-of-the-art results on existing online video benchmarks, reaching 76.4 on StreamingBench and 67.2 on OVO-Bench under real-time settings, while reducing latency by 69.9% and peak GPU memory by 34.5% on OVO-Bench. Furthermore, it maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.

Related papers

Event-Anchored Frame Selection for Effective Long-Video Understanding [67.56884568828508]
Event-Anchored Frame Selection (EFS) is a hierarchical, event-aware pipeline.<n>As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs.
arXiv Detail & Related papers (2026-03-01T08:25:37Z)
PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache [61.57938553036056]
We introduce PackCache, a training-free KV-cache management method that compacts the KV cache through three coordinated mechanisms.<n>In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences.
arXiv Detail & Related papers (2026-01-07T19:51:06Z)
Fast SAM2 with Text-Driven Token Pruning [52.8350457627401]
Segment Anything Model 2 (SAM2), a vision computation model has significantly advanced in prompt-driven video object segmentation.<n>SAM2 pipelines propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object.<n>We introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation.
arXiv Detail & Related papers (2025-12-24T18:59:05Z)
VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction [55.66673587952058]
Video understanding models are increasingly limited by the prohibitive storage and computational costs of large-scale datasets.<n>VideoCompressa is a novel framework for video data synthesis that reframes the problem as dynamic latent compression.
arXiv Detail & Related papers (2025-11-24T07:07:58Z)
Self-Supervised Compression and Artifact Correction for Streaming Underwater Imaging Sonar [14.023965177100239]
Real-time imaging sonar has become an important tool for underwater monitoring in environments where optical sensing is unreliable.<n>We present SCOPE, a self-supervised framework that jointly performs compression and artifact correction without clean-noise pairs or synthetic assumptions.<n>SCOPE has been deployed for months in three Pacific Northwest rivers to support real-time salmon enumeration and environmental monitoring in the wild.
arXiv Detail & Related papers (2025-11-17T21:19:15Z)
StreamForest: Efficient Online Video Understanding with Persistent Event Memory [37.73273040737155]
StreamForest is designed for streaming video understanding.<n>Fine-grained Spatiotemporal Window captures detailed short-term visual cues to improve current scene perception.<n>OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction.
arXiv Detail & Related papers (2025-09-29T14:53:57Z)
Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance [4.6432462796838125]
This study proposes a cross-modal differiated quantization framework for vision-language models (VLMs) and a scene-aware vectorized memory multi-agent system for visually impaired assistance.<n>The modular framework was developed implementing differentiated processing strategies, effectively reducing memory requirements from 38GB to 16GB while maintaining model performance.
arXiv Detail & Related papers (2025-08-25T16:32:32Z)
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes [85.00111442236499]
This paper presents textbfQuicksviewer, an LMM with new perceiving paradigm that partitions a video of nontemporal density into varying cubes using Gumbel Softmax.<n>We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency.<n>With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy.
arXiv Detail & Related papers (2025-04-21T17:57:21Z)
Memory-efficient Low-latency Remote Photoplethysmography through Temporal-Spatial State Space Duality [15.714133129768323]
ME-r is a memory-efficient algorithm built on temporal-spatial state space duality.<n>It efficiently captures subtle periodic variations across facial frames while maintaining minimal computational overhead.<n>Our solution enables real-time inference with only 3.6 MB memory usage and 9.46 ms latency.
arXiv Detail & Related papers (2025-04-02T14:34:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.