TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
- URL: http://arxiv.org/abs/2504.17343v1
- Date: Thu, 24 Apr 2025 07:59:46 GMT
- Title: TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
- Authors: Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu Sun,
- Abstract summary: TimeChat-Online is a novel online VideoLLM that revolutionizes real-time video interaction.<n>Our Differential Token Drop (DTD) module addresses the challenge of visual redundancy in streaming videos.<n>Our experiments demonstrate that DTD achieves an 82.8% reduction in video tokens while maintaining 98% performance on StreamingBench.
- Score: 47.91239059703758
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid growth of online video platforms, particularly live streaming services, has created an urgent need for real-time video understanding systems. These systems must process continuous video streams and respond to user queries instantaneously, presenting unique challenges for current Video Large Language Models (VideoLLMs). While existing VideoLLMs excel at processing complete videos, they face significant limitations in streaming scenarios due to their inability to handle dense, redundant frames efficiently. We introduce TimeChat-Online, a novel online VideoLLM that revolutionizes real-time video interaction. At its core lies our innovative Differential Token Drop (DTD) module, which addresses the fundamental challenge of visual redundancy in streaming videos. Drawing inspiration from human visual perception's Change Blindness phenomenon, DTD preserves meaningful temporal changes while filtering out static, redundant content between frames. Remarkably, our experiments demonstrate that DTD achieves an 82.8% reduction in video tokens while maintaining 98% performance on StreamingBench, revealing that over 80% of visual content in streaming videos is naturally redundant without requiring language guidance. To enable seamless real-time interaction, we present TimeChat-Online-139K, a comprehensive streaming video dataset featuring diverse interaction patterns including backward-tracing, current-perception, and future-responding scenarios. TimeChat-Online's unique Proactive Response capability, naturally achieved through continuous monitoring of video scene transitions via DTD, sets it apart from conventional approaches. Our extensive evaluation demonstrates TimeChat-Online's superior performance on streaming benchmarks (StreamingBench and OvOBench) and maintaining competitive results on long-form video tasks such as Video-MME and MLVU.
Related papers
- Learning Streaming Video Representation via Multitask Training [38.83730146750222]
Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions.
We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre-trained vision transformer.
We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering.
arXiv Detail & Related papers (2025-04-28T17:59:54Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.
We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - $\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation [19.616624959353697]
$infty$-Video can process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism.<n>Our framework augments video Q-formers by allowing them to process video contexts efficiently and without requiring additional training.
arXiv Detail & Related papers (2025-01-31T12:45:46Z) - Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge [57.01131456894516]
Current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios.<n>We propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction.<n>Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications.
arXiv Detail & Related papers (2025-01-23T08:33:10Z) - StreamChat: Chatting with Streaming Video [85.02875830683637]
StreamChat is a novel approach that enhances the interaction capabilities of Large Multimodal Models with streaming video content.<n>We introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs.<n>We construct a new dense instruction dataset to facilitate the training of streaming interaction models.
arXiv Detail & Related papers (2024-12-11T18:59:54Z) - Towards Long Video Understanding via Fine-detailed Video Story Generation [58.31050916006673]
Long video understanding has become a critical task in computer vision, driving advancements across numerous applications from surveillance to content retrieval.
Existing video understanding methods suffer from two challenges when dealing with long video understanding: intricate long-context relationship modeling and interference from redundancy.
We introduce Fine-Detailed Video Story generation (FDVS), which interprets long videos into detailed textual representations.
arXiv Detail & Related papers (2024-12-09T03:41:28Z) - Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams [78.72965584414368]
We present Flash-VStream, a video-language model that simulates the memory mechanism of human.
Compared to existing models, Flash-VStream achieves significant reductions in latency inference and VRAM consumption.
We propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding.
arXiv Detail & Related papers (2024-06-12T11:07:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.