Learning Streaming Video Representation via Multitask Training
- URL: http://arxiv.org/abs/2504.20041v1
- Date: Mon, 28 Apr 2025 17:59:54 GMT
- Title: Learning Streaming Video Representation via Multitask Training
- Authors: Yibin Yan, Jilan Xu, Shangzhe Di, Yikun Liu, Yudi Shi, Qirui Chen, Zeqian Li, Yifei Huang, Weidi Xie,
- Abstract summary: Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions.<n>We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre-trained vision transformer.<n>We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering.
- Score: 38.83730146750222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions.To address these challenges, our main contributions are three-fold. (i) We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre-trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability.(ii) To train StreamFormer, we propose to unify diverse spatial-temporal video understanding tasks within a multitask visual-language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real-time applications.
Related papers
- TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos [47.91239059703758]
TimeChat-Online is a novel online VideoLLM that revolutionizes real-time video interaction.<n>Our Differential Token Drop (DTD) module addresses the challenge of visual redundancy in streaming videos.<n>Our experiments demonstrate that DTD achieves an 82.8% reduction in video tokens while maintaining 98% performance on StreamingBench.
arXiv Detail & Related papers (2025-04-24T07:59:46Z) - Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge [57.01131456894516]
Current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios.<n>We propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction.<n>Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications.
arXiv Detail & Related papers (2025-01-23T08:33:10Z) - StreamChat: Chatting with Streaming Video [85.02875830683637]
StreamChat is a novel approach that enhances the interaction capabilities of Large Multimodal Models with streaming video content.<n>We introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs.<n>We construct a new dense instruction dataset to facilitate the training of streaming interaction models.
arXiv Detail & Related papers (2024-12-11T18:59:54Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - Self-supervised Spatiotemporal Representation Learning by Exploiting
Video Continuity [15.429045937335236]
This work exploits an essential yet under-explored property of videos, the textitvideo continuity, to obtain supervision signals for self-supervised representation learning.
We formulate three novel continuity-related pretext tasks, i.e. continuity justification, discontinuity localization, and missing section approximation.
This self-supervision approach, termed as Continuity Perception Network (CPNet), solves the three tasks altogether and encourages the backbone network to learn local and long-ranged motion and context representations.
arXiv Detail & Related papers (2021-12-11T00:35:27Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.