ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation
- URL: http://arxiv.org/abs/2602.05132v1
- Date: Wed, 04 Feb 2026 23:33:16 GMT
- Title: ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation
- Authors: Jia Li, Wenjie Zhao, Shijian Deng, Bolin Lai, Yuheng Wu, RUijia Chen, Jon E. Froehlich, Yuhang Zhao, Yapeng Tian,
- Abstract summary: egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames.<n>We propose ARGaze, which reformulates gaze estimation as sequential prediction.<n>We achieve state-of-the-art performance across multiple egocentric benchmarks under online evaluation.
- Score: 46.30718574969354
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames, a task essential for augmented reality and assistive technologies. Unlike third-person gaze estimation, this setting lacks explicit head or eye signals, requiring models to infer current visual attention from sparse, indirect cues such as hand-object interactions and salient scene content. We observe that gaze exhibits strong temporal continuity during goal-directed activities: knowing where a person looked recently provides a powerful prior for predicting where they look next. Inspired by vision-conditioned autoregressive decoding in vision-language models, we propose ARGaze, which reformulates gaze estimation as sequential prediction: at each timestep, a transformer decoder predicts current gaze by conditioning on (i) current visual features and (ii) a fixed-length Gaze Context Window of recent gaze target estimates. This design enforces causality and enables bounded-resource streaming inference. We achieve state-of-the-art performance across multiple egocentric benchmarks under online evaluation, with extensive ablations validating that autoregressive modeling with bounded gaze history is critical for robust prediction. We will release our source code and pre-trained models.
Related papers
- Learning Spatio-Temporal Feature Representations for Video-Based Gaze Estimation [50.05866669110754]
Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames.<n>We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules.<n>We show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation.
arXiv Detail & Related papers (2025-12-19T15:15:58Z) - StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos [128.45606644157]
StreamGaze is the first benchmark to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos.<n>We develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories.<n>We observe substantial performance gaps between state-of-the-art MLLMs and human performance.
arXiv Detail & Related papers (2025-12-01T14:15:44Z) - Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding [7.281396624646809]
Eye gaze offers valuable cues about attention, short-term intent, and future actions.<n>We propose a gaze-regularized framework that enhances VLMs for two key egocentric understanding tasks.<n>We introduce a gaze-regularized attention mechanism that aligns model focus with human visual gaze.
arXiv Detail & Related papers (2025-10-24T11:33:03Z) - EgoTraj-Bench: Towards Robust Trajectory Prediction Under Ego-view Noisy Observations [28.981146701183448]
We introduce EgoTraj-Bench, the first real-world benchmark that grounds noisy, first-person visual histories in clean, bird's-eye-view future trajectories.<n>We propose BiFlow, a dual-stream flow matching model that concurrently denoises historical observations and forecasts future motion.<n>BiFlow achieves state-of-the-art performance, reducing minADE and minFDE by 10-15% on average and demonstrating superior robustness.
arXiv Detail & Related papers (2025-10-01T01:30:13Z) - Ego-centric Predictive Model Conditioned on Hand Trajectories [52.531681772560724]
In egocentric scenarios, anticipating both the next action and its visual outcome is essential for understanding human-object interactions.<n>We propose a unified two-stage predictive framework that jointly models action and visual future in egocentric scenarios.<n>Our approach is the first unified model designed to handle both egocentric human activity understanding and robotic manipulation tasks.
arXiv Detail & Related papers (2025-08-27T13:09:55Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers [40.27531644565077]
We propose the Human Attention Transformer (HAT), a single model that predicts both forms of attention control.
HAT sets a new standard in computational attention, which emphasizes effectiveness, generality, and interpretability.
arXiv Detail & Related papers (2023-03-16T15:13:09Z) - GIMO: Gaze-Informed Human Motion Prediction in Context [75.52839760700833]
We propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, and ego-centric views with eye gaze.
Our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects.
To realize the full potential of gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches.
arXiv Detail & Related papers (2022-04-20T13:17:39Z) - Unsupervised Gaze Prediction in Egocentric Videos by Energy-based
Surprise Modeling [6.294759639481189]
Egocentric perception has grown rapidly with the advent of immersive computing devices.
Human gaze prediction is an important problem in analyzing egocentric videos.
We quantitatively analyze the generalization capabilities of supervised, deep learning models on the egocentric gaze prediction task.
arXiv Detail & Related papers (2020-01-30T21:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.