StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
- URL: http://arxiv.org/abs/2512.01707v1
- Date: Mon, 01 Dec 2025 14:15:44 GMT
- Title: StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
- Authors: Daeun Lee, Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Mohit Bansal,
- Abstract summary: StreamGaze is the first benchmark to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos.<n>We develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories.<n>We observe substantial performance gaps between state-of-the-art MLLMs and human performance.
- Score: 128.45606644157
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether MLLMs can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively evaluate streaming video understanding. These tasks assess whether models can use real-time gaze to follow shifting attention and infer user intentions from only past and currently observed frames. To build StreamGaze, we develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that closely reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, revealing fundamental limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze-prompting strategies, reasoning behaviors, and task-specific failure modes, offering deeper insight into why current MLLMs struggle and what capabilities future models must develop. All data and code will be publicly released to support continued research in gaze-guided streaming video understanding.
Related papers
- ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation [46.30718574969354]
egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames.<n>We propose ARGaze, which reformulates gaze estimation as sequential prediction.<n>We achieve state-of-the-art performance across multiple egocentric benchmarks under online evaluation.
arXiv Detail & Related papers (2026-02-04T23:33:16Z) - Learning Spatio-Temporal Feature Representations for Video-Based Gaze Estimation [50.05866669110754]
Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames.<n>We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules.<n>We show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation.
arXiv Detail & Related papers (2025-12-19T15:15:58Z) - StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios [33.70462645363648]
StreamEQA is the first benchmark for streaming video question answering in embodied scenarios.<n>It is built upon 156 independent long videos and generates approximately 21K question-answer pairs with precise timestamps.<n>We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.
arXiv Detail & Related papers (2025-12-04T04:48:16Z) - Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding [7.281396624646809]
Eye gaze offers valuable cues about attention, short-term intent, and future actions.<n>We propose a gaze-regularized framework that enhances VLMs for two key egocentric understanding tasks.<n>We introduce a gaze-regularized attention mechanism that aligns model focus with human visual gaze.
arXiv Detail & Related papers (2025-10-24T11:33:03Z) - Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs [54.502280390499756]
We propose TimeWarp to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video.<n>We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks.
arXiv Detail & Related papers (2025-10-04T21:48:40Z) - In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting [12.567763863700058]
EgoGazeVQA is an egocentric gaze-guided video question answering benchmark.<n>Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions.<n>Our gaze-guided intent prompting methods significantly enhance performance.
arXiv Detail & Related papers (2025-09-09T07:11:56Z) - StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding [52.55809460075286]
We propose a StreamAgent that anticipates the temporal intervals and spatial regions expected to contain future task-relevant information.<n>We integrate question semantics and historical observations through prompting the anticipatory agent to anticipate the temporal progression of key events.<n>Our method outperforms existing methods in response accuracy and real-time efficiency, highlighting its practical value for real-world streaming scenarios.
arXiv Detail & Related papers (2025-08-03T18:15:42Z) - SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding [56.78088668917983]
We introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains.<n>We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos.<n>Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding.
arXiv Detail & Related papers (2025-02-15T14:29:44Z) - Understanding Long Videos via LLM-Powered Entity Relation Graphs [51.13422967711056]
GraphVideoAgent is a framework that maps and monitors the evolving relationships between visual entities throughout the video sequence.<n>Our approach demonstrates remarkable effectiveness when tested against industry benchmarks.
arXiv Detail & Related papers (2025-01-27T10:57:24Z) - TPP-Gaze: Modelling Gaze Dynamics in Space and Time with Neural Temporal Point Processes [63.95928298690001]
We present TPP-Gaze, a novel and principled approach to model scanpath dynamics based on Neural Temporal Point Process (TPP)
Our results show the overall superior performance of the proposed model compared to state-of-the-art approaches.
arXiv Detail & Related papers (2024-10-30T19:22:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.