Streaming Video Instruction Tuning
- URL: http://arxiv.org/abs/2512.21334v1
- Date: Wed, 24 Dec 2025 18:59:36 GMT
- Title: Streaming Video Instruction Tuning
- Authors: Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou,
- Abstract summary: We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant.<n>Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering.
- Score: 36.380564553367954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.
Related papers
- Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing [93.8111348452324]
Tele- Omni is a unified framework for video generation and editing that follows multimodal instructions.<n>It supports text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing.
arXiv Detail & Related papers (2026-02-10T10:01:16Z) - SneakPeek: Future-Guided Instructional Streaming Video Generation [27.65687424500011]
We introduce a pipeline for future-driven streaming instructional video generation, dubbed SneakPeek.<n>Our approach introduces three key innovations to enhance consistency and controllability.<n> Experimental results demonstrate that our method produces temporally coherent and semantically faithful instructional videos.
arXiv Detail & Related papers (2025-12-15T06:32:57Z) - ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts [56.75723197779384]
ARC-Hunyuan-Video is a multimodal model that processes visual, audio, and textual signals end-to-end for structured comprehension.<n>Our model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning.
arXiv Detail & Related papers (2025-07-28T15:52:36Z) - Learning Streaming Video Representation via Multitask Training [38.83730146750222]
Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions.<n>We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre-trained vision transformer.<n>We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering.
arXiv Detail & Related papers (2025-04-28T17:59:54Z) - TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos [47.91239059703758]
TimeChat-Online is a novel online VideoLLM that revolutionizes real-time video interaction.<n>Our Differential Token Drop (DTD) module addresses the challenge of visual redundancy in streaming videos.<n>Our experiments demonstrate that DTD achieves an 82.8% reduction in video tokens while maintaining 98% performance on StreamingBench.
arXiv Detail & Related papers (2025-04-24T07:59:46Z) - Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge [57.01131456894516]
Current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios.<n>We propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction.<n>Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications.
arXiv Detail & Related papers (2025-01-23T08:33:10Z) - StreamChat: Chatting with Streaming Video [85.02875830683637]
StreamChat is a novel approach that enhances the interaction capabilities of Large Multimodal Models with streaming video content.<n>We introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs.<n>We construct a new dense instruction dataset to facilitate the training of streaming interaction models.
arXiv Detail & Related papers (2024-12-11T18:59:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.