Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
- URL: http://arxiv.org/abs/2501.03218v1
- Date: Mon, 06 Jan 2025 18:55:10 GMT
- Title: Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
- Authors: Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang,
- Abstract summary: We present Dispider, a system that disentangles Perception, Decision, and Reaction.<n>Experiments show that Dispider not only maintains strong performance in conventional video QA tasks, but also significantly surpasses previous online models in streaming scenario responses.
- Score: 81.34648970317383
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a system that disentangles Perception, Decision, and Reaction. Dispider features a lightweight proactive streaming video processing module that tracks the video stream and identifies optimal moments for interaction. Once the interaction is triggered, an asynchronous interaction module provides detailed responses, while the processing module continues to monitor the video in the meantime. Our disentangled and asynchronous design ensures timely, contextually accurate, and computationally efficient responses, making Dispider ideal for active real-time interaction for long-duration video streams. Experiments show that Dispider not only maintains strong performance in conventional video QA tasks, but also significantly surpasses previous online models in streaming scenario responses, thereby validating the effectiveness of our architecture. The code and model are released at \url{https://github.com/Mark12Ding/Dispider}.
Related papers
- StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition [19.54521322177521]
We introduce StreamMind, a video LLM framework that achieves ultra-FPS streaming video processing (100 fps on a single A100)
We propose a novel perception-cognition intertemporal paradigm named ''event-gated LLM invocation''
Experiments on Ego4D and SoccerNet streaming tasks, as well as standard offline benchmarks, demonstrate state-of-the-art performance in both model capability and real-time efficiency.
arXiv Detail & Related papers (2025-03-08T13:44:38Z) - Ready-to-React: Online Reaction Policy for Two-Character Interaction Generation [82.73098356401725]
We propose an online reaction policy, called Ready-to-React, to generate the next character pose based on past observed motions.
Each character has its own reaction policy as its "brain", enabling them to interact like real humans in a streaming manner.
Our approach can be controlled by sparse signals, making it well-suited for VR and other online interactive environments.
arXiv Detail & Related papers (2025-02-27T18:40:30Z) - Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge [57.01131456894516]
Current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios.
We propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction.
Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications.
arXiv Detail & Related papers (2025-01-23T08:33:10Z) - InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.90258030688256]
This project introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input.<n>This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
arXiv Detail & Related papers (2024-12-12T18:58:30Z) - StreamChat: Chatting with Streaming Video [85.02875830683637]
StreamChat is a novel approach that enhances the interaction capabilities of Large Multimodal Models with streaming video content.<n>We introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs.<n>We construct a new dense instruction dataset to facilitate the training of streaming interaction models.
arXiv Detail & Related papers (2024-12-11T18:59:54Z) - InterFormer: Real-time Interactive Image Segmentation [80.45763765116175]
Interactive image segmentation enables annotators to efficiently perform pixel-level annotation for segmentation tasks.
The existing interactive segmentation pipeline suffers from inefficient computations of interactive models.
We propose a method named InterFormer that follows a new pipeline to address these issues.
arXiv Detail & Related papers (2023-04-06T08:57:00Z) - AEI: Actors-Environment Interaction with Adaptive Attention for Temporal
Action Proposals Generation [15.360689782405057]
We propose Actor Environment Interaction (AEI) network to improve the video representation for temporal action proposals generation.
AEI contains two modules, i.e., perception-based visual representation (PVR) and boundary-matching module (BMM)
arXiv Detail & Related papers (2021-10-21T20:43:42Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z) - Spatio-Temporal Action Detection with Multi-Object Interaction [127.85524354900494]
In this paper, we study the S-temporal action detection problem with multi-object interaction.
We introduce a new dataset that is spatially annotated with action tubes containing multi-object interactions.
We propose an end-to-endtemporal action detection model that performs both spatial and temporal regression simultaneously.
arXiv Detail & Related papers (2020-04-01T00:54:56Z) - Collaborative Motion Prediction via Neural Motion Message Passing [37.72454920355321]
We propose neural motion message passing (NMMP) to explicitly model the interaction and learn representations for directed interactions between actors.
Based on the proposed NMMP, we design the motion prediction systems for two settings: the pedestrian setting and the joint pedestrian and vehicle setting.
Both systems outperform the previous state-of-the-art methods on several existing benchmarks.
arXiv Detail & Related papers (2020-03-14T10:12:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.