Related papers: Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

URL: http://arxiv.org/abs/2406.08085v2
Date: Sun, 30 Jun 2024 05:39:46 GMT
Title: Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
Authors: Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin,
Abstract summary: We present Flash-VStream, a video-language model that simulates the memory mechanism of human. Compared to existing models, Flash-VStream achieves significant reductions in latency inference and VRAM consumption. We propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding.
Score: 78.72965584414368
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Benefiting from the advancements in large language models and cross-modal alignment, existing multi-modal video understanding methods have achieved prominent performance in offline scenario. However, online video streams, as one of the most common media forms in the real world, have seldom received attention. Compared to offline videos, the 'dynamic' nature of online video streams poses challenges for the direct application of existing models and introduces new problems, such as the storage of extremely long-term information, interaction between continuous visual content and 'asynchronous' user questions. Therefore, in this paper we present Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously. Compared to existing models, Flash-VStream achieves significant reductions in inference latency and VRAM consumption, which is intimately related to performing understanding of online streaming video. In addition, given that existing video understanding benchmarks predominantly concentrate on offline scenario, we propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding. Comparisons with popular existing methods on the proposed benchmark demonstrate the superiority of our method for such challenging setting. To verify the generalizability of our approach, we further evaluate it on existing video understanding benchmarks and achieves state-of-the-art performance in offline scenarios as well. All code, models, and datasets are available at the https://invinciblewyq.github.io/vstream-page/

Related papers

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling [27.468345201477504]
Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions.<n>We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs.<n> Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment.
arXiv Detail & Related papers (2025-07-07T17:49:41Z)
Flash-VStream: Efficient Real-Time Understanding for Long Video Streams [64.25549822010372]
Flash-VStream is a video language model capable of processing extremely long videos and responding to user queries in real time.<n>Compared to existing models, Flash-VStream achieves significant reductions in inference latency.
arXiv Detail & Related papers (2025-06-30T13:17:49Z)
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval [13.891391928767195]
LiveVLM is a training-free framework specifically designed for streaming, online video understanding and real-time interaction.<n>LiveVLM constructs a streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs.<n>When a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information.
arXiv Detail & Related papers (2025-05-21T08:47:15Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding [56.78088668917983]
We introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains. We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos. Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding.
arXiv Detail & Related papers (2025-02-15T14:29:44Z)
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge [57.01131456894516]
Current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios. We propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction. Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications.
arXiv Detail & Related papers (2025-01-23T08:33:10Z)
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? [51.45196331624591]
OVO-Bench is a novel benchmark for advanced online video understanding capability. It consists of 12 tasks, featuring 644 unique videos and approximately human-curated 2,800 fine-grained meta-annotations with precise timestamps. Evaluations of nine Video-LLMs reveal that, despite advancements on traditional benchmarks, current models struggle with online video understanding.
arXiv Detail & Related papers (2025-01-09T19:00:01Z)
StreamChat: Chatting with Streaming Video [85.02875830683637]
StreamChat is a novel approach that enhances the interaction capabilities of Large Multimodal Models with streaming video content. We introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs. We construct a new dense instruction dataset to facilitate the training of streaming interaction models.
arXiv Detail & Related papers (2024-12-11T18:59:54Z)
Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning [71.94122309290537]
We propose an efficient, online approach to generate dense captions for videos. Our model uses a novel autoregressive factorized decoding architecture. Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute.
arXiv Detail & Related papers (2024-11-22T02:46:44Z)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. We leverage DINOv2 features to remove redundant frames that exhibit high similarity. We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z)
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges [42.555895949250704]
VideoLLaMB is a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences. SceneTilling algorithm segments videos into independent semantic units to preserve semantic integrity. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU.
arXiv Detail & Related papers (2024-09-02T08:52:58Z)
VideoLLM-online: Online Video Large Language Model for Streaming Video [27.073238234038826]
We propose a novel Learning-In-Video-Stream framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream. Our framework supports streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. It also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting.
arXiv Detail & Related papers (2024-06-17T17:55:32Z)
Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding. It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected. Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z)
Temporal Sentence Grounding in Streaming Videos [60.67022943824329]
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV) The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query. We propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames.
arXiv Detail & Related papers (2023-08-14T12:30:58Z)
OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation [75.07460026246582]
Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction. Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding. We propose a simple yet effective online model using explicit query propagation, named OnlineRefer.
arXiv Detail & Related papers (2023-07-18T15:43:35Z)
Online Streaming Video Super-Resolution with Convolutional Look-Up Table [26.628925884353674]
This paper focuses on the rarely exploited problem setting of online streaming video super resolution. New benchmark dataset named LDV-WebRTC is constructed based on a real-world online streaming system. We propose a mixture-of-expert-LUT module, where a set of LUT specialized in different degradations are built and adaptively combined to handle different degradations.
arXiv Detail & Related papers (2023-03-01T08:54:56Z)
Two-Level Temporal Relation Model for Online Video Instance Segmentation [3.9349485816629888]
We propose an online method that is on par with the performance of the offline counterparts. We introduce a message-passing graph neural network that encodes objects and relates them through time. Our model achieves trained end-to-end, state-of-the-art performance on the YouTube-VIS dataset.
arXiv Detail & Related papers (2022-10-30T10:01:01Z)
InstanceFormer: An Online Video Instance Segmentation Framework [21.760243214387987]
We propose a single-stage transformer-based efficient online VIS framework named InstanceFormer. We propose three novel components to model short-term and long-term dependency and temporal coherence. The proposed InstanceFormer outperforms previous online benchmark methods by a large margin across multiple datasets.
arXiv Detail & Related papers (2022-08-22T18:54:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.