Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed
Human Attention
- URL: http://arxiv.org/abs/2303.15274v3
- Date: Sun, 2 Jul 2023 22:52:37 GMT
- Title: Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed
Human Attention
- Authors: Sounak Mondal, Zhibo Yang, Seoyoung Ahn, Dimitris Samaras, Gregory
Zelinsky, Minh Hoai
- Abstract summary: We develop a novel model for zero-shot learning where gaze is predicted for never-before-searched objects.
Gazeformer encodes the target using a natural language model, thus leveraging semantic similarities in scanpath prediction.
It also outperforms existing target-detection models on standard gaze prediction for both target-present and target-absent search tasks.
- Score: 44.10971508325032
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Predicting human gaze is important in Human-Computer Interaction (HCI).
However, to practically serve HCI applications, gaze prediction models must be
scalable, fast, and accurate in their spatial and temporal gaze predictions.
Recent scanpath prediction models focus on goal-directed attention (search).
Such models are limited in their application due to a common approach relying
on trained target detectors for all possible objects, and the availability of
human gaze data for their training (both not scalable). In response, we pose a
new task called ZeroGaze, a new variant of zero-shot learning where gaze is
predicted for never-before-searched objects, and we develop a novel model,
Gazeformer, to solve the ZeroGaze problem. In contrast to existing methods
using object detector modules, Gazeformer encodes the target using a natural
language model, thus leveraging semantic similarities in scanpath prediction.
We use a transformer-based encoder-decoder architecture because transformers
are particularly useful for generating contextual representations. Gazeformer
surpasses other models by a large margin on the ZeroGaze setting. It also
outperforms existing target-detection models on standard gaze prediction for
both target-present and target-absent search tasks. In addition to its improved
performance, Gazeformer is more than five times faster than the
state-of-the-art target-present visual search model.
Related papers
- Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction [66.71402249062777]
We present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths.<n>Our method explicitly models scanpath variability by leveraging the nature of diffusion models, producing a wide range of plausible gaze trajectories.<n>Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios.
arXiv Detail & Related papers (2025-07-30T18:36:09Z) - Human Scanpath Prediction in Target-Present Visual Search with Semantic-Foveal Bayesian Attention [49.99728312519117]
SemBA-FAST is a top-down framework designed for predicting human visual attention in target-present visual search.<n>We evaluate SemBA-FAST on the COCO-Search18 benchmark dataset, comparing its performance against other scanpath prediction models.<n>These findings provide valuable insights into the capabilities of semantic-foveal probabilistic frameworks for human-like attention modelling.
arXiv Detail & Related papers (2025-07-24T15:19:23Z) - GazeTarget360: Towards Gaze Target Estimation in 360-Degree for Robot Perception [3.312411881096304]
We propose a system to address the problem of 360-degree gaze target estimation from an image.<n>The system, named GazeTarget360, integrates conditional inference engines of an eye-contact detector, a pre-trained vision encoder, and a multi-scale-fusion decoder.<n>Cross validation results show that GazeTarget360 can produce accurate and reliable gaze target predictions in unseen scenarios.
arXiv Detail & Related papers (2025-06-30T20:44:40Z) - Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders [33.26237143983192]
We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene.
We propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder.
arXiv Detail & Related papers (2024-12-12T18:55:30Z) - Stanceformer: Target-Aware Transformer for Stance Detection [59.69858080492586]
Stance Detection involves discerning the stance expressed in a text towards a specific subject or target.
Prior works have relied on existing transformer models that lack the capability to prioritize targets effectively.
We introduce Stanceformer, a target-aware transformer model that incorporates enhanced attention towards the targets during both training and inference.
arXiv Detail & Related papers (2024-10-09T17:24:28Z) - CrossGaze: A Strong Method for 3D Gaze Estimation in the Wild [4.089889918897877]
We propose CrossGaze, a strong baseline for gaze estimation.
Our model surpasses several state-of-the-art methods, achieving a mean angular error of 9.94 degrees.
Our proposed model serves as a strong foundation for future research and development in gaze estimation.
arXiv Detail & Related papers (2024-02-13T09:20:26Z) - Human motion trajectory prediction using the Social Force Model for
real-time and low computational cost applications [3.5970055082749655]
We propose a novel trajectory prediction model, Social Force Generative Adversarial Network (SoFGAN)
SoFGAN uses a Generative Adversarial Network (GAN) and Social Force Model (SFM) to generate different plausible people trajectories reducing collisions in a scene.
We show that our method is more accurate in making predictions in UCY or BIWI datasets than most of the current state-of-the-art models and also reduces collisions in comparison to other approaches.
arXiv Detail & Related papers (2023-11-17T15:32:21Z) - Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers [40.27531644565077]
We propose the Human Attention Transformer (HAT), a single model that predicts both forms of attention control.
HAT sets a new standard in computational attention, which emphasizes effectiveness, generality, and interpretability.
arXiv Detail & Related papers (2023-03-16T15:13:09Z) - TempSAL -- Uncovering Temporal Information for Deep Saliency Prediction [64.63645677568384]
We introduce a novel saliency prediction model that learns to output saliency maps in sequential time intervals.
Our approach locally modulates the saliency predictions by combining the learned temporal maps.
Our code will be publicly available on GitHub.
arXiv Detail & Related papers (2023-01-05T22:10:16Z) - 3DGazeNet: Generalizing Gaze Estimation with Weak-Supervision from
Synthetic Views [67.00931529296788]
We propose to train general gaze estimation models which can be directly employed in novel environments without adaptation.
We create a large-scale dataset of diverse faces with gaze pseudo-annotations, which we extract based on the 3D geometry of the scene.
We test our method in the task of gaze generalization, in which we demonstrate improvement of up to 30% compared to state-of-the-art when no ground truth data are available.
arXiv Detail & Related papers (2022-12-06T14:15:17Z) - End-to-End Human-Gaze-Target Detection with Transformers [57.00864538284686]
We propose an effective and efficient method for Human-Gaze-Target (HGT) detection, i.e., gaze following.
Our method, named Human-Gaze-Target detection TRansformer or HGTTR, streamlines the HGT detection pipeline by eliminating all other components.
The effectiveness and robustness of our proposed method are verified with extensive experiments on the two standard benchmark datasets, GazeFollowing and VideoAttentionTarget.
arXiv Detail & Related papers (2022-03-20T02:37:06Z) - L2CS-Net: Fine-Grained Gaze Estimation in Unconstrained Environments [2.5234156040689237]
We propose a robust CNN-based model for predicting gaze in unconstrained settings.
We use two identical losses, one for each angle, to improve network learning and increase its generalization.
Our proposed model achieves state-of-the-art accuracy of 3.92deg and 10.41deg on MPIIGaze and Gaze360 datasets, respectively.
arXiv Detail & Related papers (2022-03-07T12:35:39Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.