Related papers: Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention

Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention

URL: http://arxiv.org/abs/2303.15274v3
Date: Sun, 2 Jul 2023 22:52:37 GMT
Title: Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention
Authors: Sounak Mondal, Zhibo Yang, Seoyoung Ahn, Dimitris Samaras, Gregory Zelinsky, Minh Hoai
Abstract summary: We develop a novel model for zero-shot learning where gaze is predicted for never-before-searched objects. Gazeformer encodes the target using a natural language model, thus leveraging semantic similarities in scanpath prediction. It also outperforms existing target-detection models on standard gaze prediction for both target-present and target-absent search tasks.
Score: 44.10971508325032
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Predicting human gaze is important in Human-Computer Interaction (HCI). However, to practically serve HCI applications, gaze prediction models must be scalable, fast, and accurate in their spatial and temporal gaze predictions. Recent scanpath prediction models focus on goal-directed attention (search). Such models are limited in their application due to a common approach relying on trained target detectors for all possible objects, and the availability of human gaze data for their training (both not scalable). In response, we pose a new task called ZeroGaze, a new variant of zero-shot learning where gaze is predicted for never-before-searched objects, and we develop a novel model, Gazeformer, to solve the ZeroGaze problem. In contrast to existing methods using object detector modules, Gazeformer encodes the target using a natural language model, thus leveraging semantic similarities in scanpath prediction. We use a transformer-based encoder-decoder architecture because transformers are particularly useful for generating contextual representations. Gazeformer surpasses other models by a large margin on the ZeroGaze setting. It also outperforms existing target-detection models on standard gaze prediction for both target-present and target-absent search tasks. In addition to its improved performance, Gazeformer is more than five times faster than the state-of-the-art target-present visual search model.

Related papers

Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction [66.71402249062777]
We present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths.<n>Our method explicitly models scanpath variability by leveraging the nature of diffusion models, producing a wide range of plausible gaze trajectories.<n>Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios.
arXiv Detail & Related papers (2025-07-30T18:36:09Z)
Human Scanpath Prediction in Target-Present Visual Search with Semantic-Foveal Bayesian Attention [49.99728312519117]
SemBA-FAST is a top-down framework designed for predicting human visual attention in target-present visual search.<n>We evaluate SemBA-FAST on the COCO-Search18 benchmark dataset, comparing its performance against other scanpath prediction models.<n>These findings provide valuable insights into the capabilities of semantic-foveal probabilistic frameworks for human-like attention modelling.
arXiv Detail & Related papers (2025-07-24T15:19:23Z)
GazeTarget360: Towards Gaze Target Estimation in 360-Degree for Robot Perception [3.312411881096304]
We propose a system to address the problem of 360-degree gaze target estimation from an image.<n>The system, named GazeTarget360, integrates conditional inference engines of an eye-contact detector, a pre-trained vision encoder, and a multi-scale-fusion decoder.<n>Cross validation results show that GazeTarget360 can produce accurate and reliable gaze target predictions in unseen scenarios.
arXiv Detail & Related papers (2025-06-30T20:44:40Z)
Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders [33.26237143983192]
We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. We propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder.
arXiv Detail & Related papers (2024-12-12T18:55:30Z)
Stanceformer: Target-Aware Transformer for Stance Detection [59.69858080492586]
Stance Detection involves discerning the stance expressed in a text towards a specific subject or target. Prior works have relied on existing transformer models that lack the capability to prioritize targets effectively. We introduce Stanceformer, a target-aware transformer model that incorporates enhanced attention towards the targets during both training and inference.
arXiv Detail & Related papers (2024-10-09T17:24:28Z)
CrossGaze: A Strong Method for 3D Gaze Estimation in the Wild [4.089889918897877]
We propose CrossGaze, a strong baseline for gaze estimation. Our model surpasses several state-of-the-art methods, achieving a mean angular error of 9.94 degrees. Our proposed model serves as a strong foundation for future research and development in gaze estimation.
arXiv Detail & Related papers (2024-02-13T09:20:26Z)
Human motion trajectory prediction using the Social Force Model for real-time and low computational cost applications [3.5970055082749655]
We propose a novel trajectory prediction model, Social Force Generative Adversarial Network (SoFGAN) SoFGAN uses a Generative Adversarial Network (GAN) and Social Force Model (SFM) to generate different plausible people trajectories reducing collisions in a scene. We show that our method is more accurate in making predictions in UCY or BIWI datasets than most of the current state-of-the-art models and also reduces collisions in comparison to other approaches.
arXiv Detail & Related papers (2023-11-17T15:32:21Z)
Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers [40.27531644565077]
We propose the Human Attention Transformer (HAT), a single model that predicts both forms of attention control. HAT sets a new standard in computational attention, which emphasizes effectiveness, generality, and interpretability.
arXiv Detail & Related papers (2023-03-16T15:13:09Z)
TempSAL -- Uncovering Temporal Information for Deep Saliency Prediction [64.63645677568384]
We introduce a novel saliency prediction model that learns to output saliency maps in sequential time intervals. Our approach locally modulates the saliency predictions by combining the learned temporal maps. Our code will be publicly available on GitHub.
arXiv Detail & Related papers (2023-01-05T22:10:16Z)
3DGazeNet: Generalizing Gaze Estimation with Weak-Supervision from Synthetic Views [67.00931529296788]
We propose to train general gaze estimation models which can be directly employed in novel environments without adaptation. We create a large-scale dataset of diverse faces with gaze pseudo-annotations, which we extract based on the 3D geometry of the scene. We test our method in the task of gaze generalization, in which we demonstrate improvement of up to 30% compared to state-of-the-art when no ground truth data are available.
arXiv Detail & Related papers (2022-12-06T14:15:17Z)
End-to-End Human-Gaze-Target Detection with Transformers [57.00864538284686]
We propose an effective and efficient method for Human-Gaze-Target (HGT) detection, i.e., gaze following. Our method, named Human-Gaze-Target detection TRansformer or HGTTR, streamlines the HGT detection pipeline by eliminating all other components. The effectiveness and robustness of our proposed method are verified with extensive experiments on the two standard benchmark datasets, GazeFollowing and VideoAttentionTarget.
arXiv Detail & Related papers (2022-03-20T02:37:06Z)
L2CS-Net: Fine-Grained Gaze Estimation in Unconstrained Environments [2.5234156040689237]
We propose a robust CNN-based model for predicting gaze in unconstrained settings. We use two identical losses, one for each angle, to improve network learning and increase its generalization. Our proposed model achieves state-of-the-art accuracy of 3.92deg and 10.41deg on MPIIGaze and Gaze360 datasets, respectively.
arXiv Detail & Related papers (2022-03-07T12:35:39Z)
STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data. Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.