Related papers: GazeDETR: Gaze Detection using Disentangled Head and Gaze Representations

GazeDETR: Gaze Detection using Disentangled Head and Gaze Representations

URL: http://arxiv.org/abs/2508.12966v1
Date: Mon, 18 Aug 2025 14:41:18 GMT
Title: GazeDETR: Gaze Detection using Disentangled Head and Gaze Representations
Authors: Ryan Anthony Jalova de Belen, Gelareh Mohammadi, Arcot Sowmya,
Abstract summary: We propose GazeDETR, a novel end-to-end architecture with two disentangled decoders.<n>Our proposed architecture achieves state-of-the-art results on the GazeFollow, VideoAttentionTarget and ChildPlay datasets.
Score: 14.82916312780764
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gaze communication plays a crucial role in daily social interactions. Quantifying this behavior can help in human-computer interaction and digital phenotyping. While end-to-end models exist for gaze target detection, they only utilize a single decoder to simultaneously localize human heads and predict their corresponding gaze (e.g., 2D points or heatmap) in a scene. This multitask learning approach generates a unified and entangled representation for human head localization and gaze location prediction. Herein, we propose GazeDETR, a novel end-to-end architecture with two disentangled decoders that individually learn unique representations and effectively utilize coherent attentive fields for each subtask. More specifically, we demonstrate that its human head predictor utilizes local information, while its gaze decoder incorporates both local and global information. Our proposed architecture achieves state-of-the-art results on the GazeFollow, VideoAttentionTarget and ChildPlay datasets. It outperforms existing end-to-end models with a notable margin.

Related papers

Gaze-guided Hand-Object Interaction Synthesis: Dataset and Method [61.19028558470065]
We present GazeHOI, the first dataset to capture simultaneous 3D modeling of gaze, hand, and object interactions.<n>To tackle these issues, we propose a stacked gaze-guided hand-object interaction diffusion model, named GHO-Diffusion.<n>We also introduce HOI-Manifold Guidance during the sampling stage of GHO-Diffusion, enabling fine-grained control over generated motions.
arXiv Detail & Related papers (2024-03-24T14:24:13Z)
Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs) We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z)
Pose2Gaze: Eye-body Coordination during Daily Activities for Gaze Prediction from Full-body Poses [11.545286742778977]
We first report a comprehensive analysis of eye-body coordination in various human-object and human-human interaction activities. We then present Pose2Gaze, a eye-body coordination model that uses a convolutional neural network to extract features from head direction and full-body poses.
arXiv Detail & Related papers (2023-12-19T10:55:46Z)
Sharingan: A Transformer-based Architecture for Gaze Following [14.594691605523005]
We introduce a novel transformer-based architecture for 2D gaze prediction. This paper achieves state-of-the-art results on the GazeFollow and VideoTarget datasets.
arXiv Detail & Related papers (2023-10-01T23:14:54Z)
RAZE: Region Guided Self-Supervised Gaze Representation Learning [5.919214040221055]
RAZE is a Region guided self-supervised gAZE representation learning framework which leverage from non-annotated facial image data. Ize-Net is a capsule layer based CNN architecture which can efficiently capture rich eye representation.
arXiv Detail & Related papers (2022-08-04T06:23:49Z)
GIMO: Gaze-Informed Human Motion Prediction in Context [75.52839760700833]
We propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, and ego-centric views with eye gaze. Our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects. To realize the full potential of gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches.
arXiv Detail & Related papers (2022-04-20T13:17:39Z)
End-to-End Human-Gaze-Target Detection with Transformers [57.00864538284686]
We propose an effective and efficient method for Human-Gaze-Target (HGT) detection, i.e., gaze following. Our method, named Human-Gaze-Target detection TRansformer or HGTTR, streamlines the HGT detection pipeline by eliminating all other components. The effectiveness and robustness of our proposed method are verified with extensive experiments on the two standard benchmark datasets, GazeFollowing and VideoAttentionTarget.
arXiv Detail & Related papers (2022-03-20T02:37:06Z)
L2CS-Net: Fine-Grained Gaze Estimation in Unconstrained Environments [2.5234156040689237]
We propose a robust CNN-based model for predicting gaze in unconstrained settings. We use two identical losses, one for each angle, to improve network learning and increase its generalization. Our proposed model achieves state-of-the-art accuracy of 3.92deg and 10.41deg on MPIIGaze and Gaze360 datasets, respectively.
arXiv Detail & Related papers (2022-03-07T12:35:39Z)
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks. To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame. Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)
DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection. Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features. In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.