StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios
- URL: http://arxiv.org/abs/2512.04451v1
- Date: Thu, 04 Dec 2025 04:48:16 GMT
- Title: StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios
- Authors: Yifei Wang, Zhenkai Li, Tianwen Qian, Huanran Zheng, Zheng Wang, Yuqian Fu, Xiaoling Wang,
- Abstract summary: StreamEQA is the first benchmark for streaming video question answering in embodied scenarios.<n>It is built upon 156 independent long videos and generates approximately 21K question-answer pairs with precise timestamps.<n>We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.
- Score: 33.70462645363648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As embodied intelligence advances toward real-world deployment, the ability to continuously perceive and reason over streaming visual inputs becomes essential. In such settings, an agent must maintain situational awareness of its environment, comprehend the interactions with surrounding entities, and dynamically plan actions informed by past observations, current contexts, and anticipated future events. To facilitate progress in this direction, we introduce StreamEQA, the first benchmark designed for streaming video question answering in embodied scenarios. StreamEQA evaluates existing MLLMs along two orthogonal dimensions: Embodied and Streaming. Along the embodied dimension, we categorize the questions into three levels: perception, interaction, and planning, which progressively assess a model's ability to recognize fine-grained visual details, reason about agent-object interactions, and perform high-level goal-directed reasoning. For the streaming dimension, questions are divided into backward, real-time, and forward reasoning, with each mode relying on a distinct temporal context. Built upon 156 independent long videos, StreamEQA defines 42 tasks and generates approximately 21K question-answer pairs with precise timestamps through a hybrid pipeline combining automated generation and human refinement. Evaluations of 13 state-of-the-art video-LLMs reveal that, despite strong performance on conventional benchmarks, these models still struggle with streaming video understanding in embodied scenarios. We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.
Related papers
- StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos [128.45606644157]
StreamGaze is the first benchmark to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos.<n>We develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories.<n>We observe substantial performance gaps between state-of-the-art MLLMs and human performance.
arXiv Detail & Related papers (2025-12-01T14:15:44Z) - StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA [60.86024022291499]
We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming video streams.<n>Our framework generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion.<n>This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference.
arXiv Detail & Related papers (2025-10-29T09:47:38Z) - AHA - Predicting What Matters Next: Online Highlight Detection Without Looking Ahead [4.55107996328448]
Aha is an autoregressive highlight detection framework that predicts relevance of each video frame against a task described in natural language.<n>Aha achieves state-of-the-art (SOTA) performance on highlight detection benchmarks.<n>We explore Aha's potential for real-world robotics applications given a task-oriented natural language input and a continuous, robot-centric video.
arXiv Detail & Related papers (2025-09-19T21:03:00Z) - StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding [52.55809460075286]
We propose a StreamAgent that anticipates the temporal intervals and spatial regions expected to contain future task-relevant information.<n>We integrate question semantics and historical observations through prompting the anticipatory agent to anticipate the temporal progression of key events.<n>Our method outperforms existing methods in response accuracy and real-time efficiency, highlighting its practical value for real-world streaming scenarios.
arXiv Detail & Related papers (2025-08-03T18:15:42Z) - HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding [120.84817886550765]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z) - ImplicitQA: Going beyond frames towards Implicit Video Reasoning [39.63171940350552]
ImplicitQA is a novel benchmark designed to test VideoQA models on human-like implicit reasoning.<n>ImplicitQA comprises 1K meticulously annotated QA pairs drawn from 1K high-quality creative video clips.
arXiv Detail & Related papers (2025-06-26T19:53:54Z) - A Challenge to Build Neuro-Symbolic Video Agents [5.243155799248514]
We show how a neuro-symbolic perspective can enhance interpretability, enable structured reasoning, and provide stronger guarantees on system behavior.<n>We present a grand challenge to the research community: developing the next generation of intelligent video agents.<n>By addressing these pillars, we can transition from passive perception to intelligent video agents that reason, predict, and act.
arXiv Detail & Related papers (2025-05-20T02:53:21Z) - StreamChat: Chatting with Streaming Video [85.02875830683637]
StreamChat is a novel approach that enhances the interaction capabilities of Large Multimodal Models with streaming video content.<n>We introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs.<n>We construct a new dense instruction dataset to facilitate the training of streaming interaction models.
arXiv Detail & Related papers (2024-12-11T18:59:54Z) - Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks.
We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.