Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge
- URL: http://arxiv.org/abs/2501.13468v1
- Date: Thu, 23 Jan 2025 08:33:10 GMT
- Title: Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge
- Authors: Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, Huchuan Lu,
- Abstract summary: Current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios.
We propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction.
Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications.
- Score: 57.01131456894516
- License:
- Abstract: Recent advances in Large Language Models (LLMs) have enabled the development of Video-LLMs, advancing multimodal learning by bridging video data with language tasks. However, current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios. To address these issues, we propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction. $\StreamChat$ leverages a novel hierarchical memory system to efficiently process and compress video features over extended sequences, enabling real-time, multi-turn dialogue. Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications. Furthermore, we introduce StreamBench, a versatile benchmark that evaluates streaming video understanding across diverse media types and interactive scenarios, including multi-turn interactions and complex reasoning tasks. Extensive evaluations on StreamBench and other public benchmarks demonstrate that StreamChat significantly outperforms existing state-of-the-art models in terms of accuracy and response times, confirming its effectiveness for streaming video understanding. Code is available at StreamChat: https://github.com/hmxiong/StreamChat.
Related papers
- SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding [56.78088668917983]
We introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains.
We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos.
Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding.
arXiv Detail & Related papers (2025-02-15T14:29:44Z) - StreamChat: Chatting with Streaming Video [85.02875830683637]
StreamChat is a novel approach that enhances the interaction capabilities of Large Multimodal Models with streaming video content.
We introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs.
We construct a new dense instruction dataset to facilitate the training of streaming interaction models.
arXiv Detail & Related papers (2024-12-11T18:59:54Z) - OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding [34.17871202332497]
OmChat is a model designed to excel in handling long contexts and video understanding tasks.
It uses a dynamic vision encoding process to effectively handle images of various resolutions, capturing fine details across a range of image qualities.
With support for a context length of up to 512K, OmChat demonstrates promising performance in tasks involving multiple images and videos.
arXiv Detail & Related papers (2024-07-06T02:16:10Z) - VideoLLM-online: Online Video Large Language Model for Streaming Video [27.073238234038826]
We propose a novel Learning-In-Video-Stream framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream.
Our framework supports streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU.
It also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting.
arXiv Detail & Related papers (2024-06-17T17:55:32Z) - Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams [78.72965584414368]
We present Flash-VStream, a video-language model that simulates the memory mechanism of human.
Compared to existing models, Flash-VStream achieves significant reductions in latency inference and VRAM consumption.
We propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding.
arXiv Detail & Related papers (2024-06-12T11:07:55Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.