StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
- URL: http://arxiv.org/abs/2505.05467v1
- Date: Thu, 08 May 2025 17:57:40 GMT
- Title: StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
- Authors: Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, Ping Huang,
- Abstract summary: We present StreamBridge, a framework that seamlessly transforms offline Video-LLMs into streaming-capable models.<n>It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms.
- Score: 35.00546613484669
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy, supporting long-context multi-turn interactions, and (2) a decoupled, lightweight activation model that can be effortlessly integrated into existing Video-LLMs, enabling continuous proactive responses. To further support StreamBridge, we construct Stream-IT, a large-scale dataset tailored for streaming video understanding, featuring interleaved video-text sequences and diverse instruction formats. Extensive experiments show that StreamBridge significantly improves the streaming understanding capabilities of offline Video-LLMs across various tasks, outperforming even proprietary models such as GPT-4o and Gemini 1.5 Pro. Simultaneously, it achieves competitive or superior performance on standard video understanding benchmarks.
Related papers
- Seedance 1.0: Exploring the Boundaries of Video Generation Models [71.26796999246068]
Seedance 1.0 is a high-performance and inference-efficient video foundation generation model.<n>It integrates multi-source curation data augmented with precision and meaningful video captioning.<n>Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds ( NVIDIA-L20)
arXiv Detail & Related papers (2025-06-10T17:56:11Z) - LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval [13.891391928767195]
LiveVLM is a training-free framework specifically designed for streaming, online video understanding and real-time interaction.<n>LiveVLM constructs a streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs.<n>When a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information.
arXiv Detail & Related papers (2025-05-21T08:47:15Z) - Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge [57.01131456894516]
Current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios.<n>We propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction.<n>Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications.
arXiv Detail & Related papers (2025-01-23T08:33:10Z) - StreamChat: Chatting with Streaming Video [85.02875830683637]
StreamChat is a novel approach that enhances the interaction capabilities of Large Multimodal Models with streaming video content.<n>We introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs.<n>We construct a new dense instruction dataset to facilitate the training of streaming interaction models.
arXiv Detail & Related papers (2024-12-11T18:59:54Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.<n>During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.<n>Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - VideoLLM-online: Online Video Large Language Model for Streaming Video [27.073238234038826]
We propose a novel Learning-In-Video-Stream framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream.
Our framework supports streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU.
It also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting.
arXiv Detail & Related papers (2024-06-17T17:55:32Z) - Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams [78.72965584414368]
We present Flash-VStream, a video-language model that simulates the memory mechanism of human.
Compared to existing models, Flash-VStream achieves significant reductions in latency inference and VRAM consumption.
We propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding.
arXiv Detail & Related papers (2024-06-12T11:07:55Z) - StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [58.49820807662246]
We introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions.<n>Our code will be available at: https://github.com/Picsart-AI-Research/StreamingT2V.
arXiv Detail & Related papers (2024-03-21T18:27:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.