Related papers: Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding

Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding

URL: http://arxiv.org/abs/2504.13915v1
Date: Thu, 10 Apr 2025 17:13:08 GMT
Title: Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding
Authors: Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bugra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Cihan Camgöz, Shreyas Hampali, Eric Sauser, Shugao Ma, Angela Yao, Fadime Sener,
Abstract summary: We introduce ProVideLLM, an end-to-end framework for real-time procedural video understanding.<n>ProVideLLM integrates a multimodal cache configured to store two types of tokens.<n>By interleaving these tokens in our multimodal cache, ProVideLLM ensures sub-linear scaling of memory and compute with video length.
Score: 51.91097761028129
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce ProVideLLM, an end-to-end framework for real-time procedural video understanding. ProVideLLM integrates a multimodal cache configured to store two types of tokens - verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by 22x over existing methods in representing one hour of long-term observations while effectively encoding fine-granularity of the present. By interleaving these tokens in our multimodal cache, ProVideLLM ensures sub-linear scaling of memory and compute with video length, enabling per-frame streaming inference at 10 FPS and streaming dialogue at 25 FPS, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.

Related papers

Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation [38.256412418893554]
We develop ViLaMP, a hierarchical video-language model that processes hour-long videos at mixed precision. ViLaMP's superior performance across four video understanding benchmarks, particularly on long-form content. Notably, ViLaMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU.
arXiv Detail & Related papers (2025-04-03T09:55:09Z)
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers [23.541896057977745]
VideoScan is an efficient vision-language model (VLM) inference framework for real-time video interaction.<n>VideoScan employs a single semantic carrier token to represent each frame.
arXiv Detail & Related papers (2025-03-12T13:30:40Z)
Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment [0.0]
Long Video Question Answering (LVQA) is challenging due to the need for temporal reasoning and large-scale multimodal data processing. We introduce UMaT, a retrieval-augmented generation framework that efficiently processes extremely long videos. We show that UMaT outperforms existing methods in multimodal integration, long-form video understanding, and sparse information retrieval.
arXiv Detail & Related papers (2025-03-12T05:28:24Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.<n>Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.<n>By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z)
ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information.<n>We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity.<n>We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. We leverage DINOv2 features to remove redundant frames that exhibit high similarity. We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding. It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected. Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.