Eyes on Target: Gaze-Aware Object Detection in Egocentric Video
- URL: http://arxiv.org/abs/2511.01237v1
- Date: Mon, 03 Nov 2025 05:21:58 GMT
- Title: Eyes on Target: Gaze-Aware Object Detection in Egocentric Video
- Authors: Vishakha Lall, Yisi Liu,
- Abstract summary: We propose Eyes on Target, a novel depth-aware and gaze-guided object detection framework for egocentric videos.<n>Our approach injects gaze-derived features into the attention mechanism of a Vision Transformer (ViT), effectively biasing spatial feature selection toward human-attended regions.<n>We validate our method on an egocentric simulator dataset where human visual attention is critical for task assessment.
- Score: 1.3320917259299652
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human gaze offers rich supervisory signals for understanding visual attention in complex visual environments. In this paper, we propose Eyes on Target, a novel depth-aware and gaze-guided object detection framework designed for egocentric videos. Our approach injects gaze-derived features into the attention mechanism of a Vision Transformer (ViT), effectively biasing spatial feature selection toward human-attended regions. Unlike traditional object detectors that treat all regions equally, our method emphasises viewer-prioritised areas to enhance object detection. We validate our method on an egocentric simulator dataset where human visual attention is critical for task assessment, illustrating its potential in evaluating human performance in simulation scenarios. We evaluate the effectiveness of our gaze-integrated model through extensive experiments and ablation studies, demonstrating consistent gains in detection accuracy over gaze-agnostic baselines on both the custom simulator dataset and public benchmarks, including Ego4D Ego-Motion and Ego-CH-Gaze datasets. To interpret model behaviour, we also introduce a gaze-aware attention head importance metric, revealing how gaze cues modulate transformer attention dynamics.
Related papers
- Revisiting Salient Object Detection from an Observer-Centric Perspective [48.99721284788945]
We propose Observer-Centric Salient Object Detection (OC-SOD), where salient regions are predicted by considering not only the visual cues but also the observer-specific factors such as their preferences or intents.<n>As a result, this formulation captures the intrinsic ambiguity and diversity of human perception, enabling personalized and context-aware saliency prediction.
arXiv Detail & Related papers (2026-02-06T03:53:01Z) - EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial Intelligence with Physical-Dynamic and Intent-Driven Understanding [56.89359230139883]
We introduce Teleo-Spatial Intelligence (TSI), a new paradigm that unifies two critical pillars: Physical-Dynamic Reasoning and Intent-Driven Reasoning.<n>We present EscherVerse, consisting of a large-scale, open-world benchmark (Escher-Bench), a dataset (Escher-35k), and models (Escher series)<n>It is the first benchmark to systematically assess Intent-Driven Reasoning, challenging models to connect physical events to their underlying human purposes.
arXiv Detail & Related papers (2026-01-04T14:42:39Z) - HAGI++: Head-Assisted Gaze Imputation and Generation [19.626054627997778]
We introduce HAGI++ - a multi-modal diffusion-based approach for gaze data imputation.<n>It uses the integrated head orientation sensors to exploit the inherent correlation between head and eye movements.<n>Our method paves the way for more complete and accurate eye gaze recordings in real-world settings.
arXiv Detail & Related papers (2025-11-04T10:51:34Z) - Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding [7.281396624646809]
Eye gaze offers valuable cues about attention, short-term intent, and future actions.<n>We propose a gaze-regularized framework that enhances VLMs for two key egocentric understanding tasks.<n>We introduce a gaze-regularized attention mechanism that aligns model focus with human visual gaze.
arXiv Detail & Related papers (2025-10-24T11:33:03Z) - Human Scanpath Prediction in Target-Present Visual Search with Semantic-Foveal Bayesian Attention [49.99728312519117]
SemBA-FAST is a top-down framework designed for predicting human visual attention in target-present visual search.<n>We evaluate SemBA-FAST on the COCO-Search18 benchmark dataset, comparing its performance against other scanpath prediction models.<n>These findings provide valuable insights into the capabilities of semantic-foveal probabilistic frameworks for human-like attention modelling.
arXiv Detail & Related papers (2025-07-24T15:19:23Z) - Enhancing Saliency Prediction in Monitoring Tasks: The Role of Visual Highlights [4.0361765428523135]
We develop a new saliency model to infer the visual attention change in the highlight condition.
Our findings show the effectiveness of visual highlights in enhancing user attention and demonstrate the potential of incorporating these cues into saliency prediction models.
arXiv Detail & Related papers (2024-05-15T20:43:57Z) - GazeFusion: Saliency-Guided Image Generation [50.37783903347613]
Diffusion models offer unprecedented image generation power given just a text prompt.<n>We present a saliency-guided framework to incorporate the data priors of human visual attention mechanisms into the generation process.
arXiv Detail & Related papers (2024-03-16T21:01:35Z) - Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection.
We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - CLERA: A Unified Model for Joint Cognitive Load and Eye Region Analysis
in the Wild [18.79132232751083]
Real-time analysis of the dynamics of the eye region allows us to monitor humans' visual attention allocation and estimate their mental state.
We propose CLERA, which achieves precise keypoint detection andtemporal tracking in a joint-learning framework.
We also introduce a large-scale dataset of 30k human faces with joint pupil, eye-openness, and landmark annotation.
arXiv Detail & Related papers (2023-06-26T21:20:23Z) - Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision.
We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z) - GIMO: Gaze-Informed Human Motion Prediction in Context [75.52839760700833]
We propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, and ego-centric views with eye gaze.
Our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects.
To realize the full potential of gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches.
arXiv Detail & Related papers (2022-04-20T13:17:39Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - Integrating Human Gaze into Attention for Egocentric Activity
Recognition [40.517438760096056]
We introduce an effective probabilistic approach to integrate human gaze intotemporal attention for egocentric activity recognition.
We represent the locations gaze fixation points as structured discrete latent variables to model their uncertainties.
The predicted gaze locations are used to provide informative attentional cues to improve the recognition performance.
arXiv Detail & Related papers (2020-11-08T08:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.