Related papers: Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video

Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video

URL: http://arxiv.org/abs/2510.14560v1
Date: Thu, 16 Oct 2025 11:11:13 GMT
Title: Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video
Authors: Yulin Zhang, Cheng Shi, Yang Wang, Sibei Yang,
Abstract summary: We focus on the innovative task where, given ego-streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment.<n>This task embodies three key properties: (1) Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized Efficiency.<n>We propose a comprehensive technical pipeline to enable models to tackle this challenging task.
Score: 36.94345183020698
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Envision an AI capable of functioning in human-like settings, moving beyond mere observation to actively understand, anticipate, and proactively respond to unfolding events. Towards this vision, we focus on the innovative task where, given ego-streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment, while maintaining synchronized perception and reasoning. This task embodies three key properties: (1) Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized Efficiency. To evaluate and address these properties, we first introduce ESTP-Bench (Ego Streaming Proactive Benchmark) alongside the ESTP-F1 metric-a novel framework designed for their rigorous assessment. Secondly, we propose a comprehensive technical pipeline to enable models to tackle this challenging task. This pipeline comprises: (1) a data engine, (2) a multi-stage training strategy, and (3) a proactive dynamic compression technique. Our proposed model effectively addresses these critical properties while outperforming multiple baselines across diverse online and offline benchmarks. Project Page:https://zhangyl4.github.io/publications/eyes-wide-open/

Related papers

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions [52.23988809605433]
We instantiate AI companions through two gaming scenarios, commentator and guide, selected for automatic evaluation.<n>We present Proact-VL, a framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction.
arXiv Detail & Related papers (2026-03-03T19:02:46Z)
StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios [33.70462645363648]
StreamEQA is the first benchmark for streaming video question answering in embodied scenarios.<n>It is built upon 156 independent long videos and generates approximately 21K question-answer pairs with precise timestamps.<n>We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.
arXiv Detail & Related papers (2025-12-04T04:48:16Z)
Video-LLMs with Temporal Visual Screening [59.18455762289321]
Temporal Visual Screening (TVS) is a new task that universally pre-processes video question answering and instruction tuning data.<n>TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines.<n> Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference)
arXiv Detail & Related papers (2025-08-27T14:33:32Z)
DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving [62.62464518137153]
DriveTransformer is a simplified E2E-AD framework for the ease of scaling up.<n>It is composed of three unified operations: task self-attention, sensor cross-attention, temporal cross-attention.<n>It achieves state-of-the-art performance in both simulated closed-loop benchmark Bench2Drive and real world open-loop benchmark nuScenes with high FPS.
arXiv Detail & Related papers (2025-03-07T11:41:18Z)
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z)
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.<n>By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.<n>We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z)
QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View [2.3982875575861677]
We present our solutions for a spectrum of automation tasks in life-saving intervention procedures within the Trauma THOMPSON (T3) Challenge. For action recognition and anticipation, we propose a pre-processing strategy that samples and stitches multiple inputs into a single image. For training, we present an action dictionary-guided design, which consistently yields the most favorable results.
arXiv Detail & Related papers (2024-07-18T06:55:26Z)
Action Recognition with Multi-stream Motion Modeling and Mutual Information Maximization [44.73161606369333]
Action recognition is a fundamental and intriguing problem in artificial intelligence. We introduce a novel Stream-GCN network equipped with multi-stream components and channel attention. Our approach sets the new state-of-the-art performance on three benchmark datasets.
arXiv Detail & Related papers (2023-06-13T06:56:09Z)
Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video [27.391434284586985]
Rolling-Unrolling LSTM is a learning architecture to anticipate actions from egocentric videos. The proposed approach is validated on EPIC-Kitchens, EGTEA Gaze+ and ActivityNet.
arXiv Detail & Related papers (2020-05-04T14:13:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.