Spatio-Temporal Attention and Gaussian Processes for Personalized Video Gaze Estimation
- URL: http://arxiv.org/abs/2404.05215v2
- Date: Wed, 10 Apr 2024 00:49:11 GMT
- Title: Spatio-Temporal Attention and Gaussian Processes for Personalized Video Gaze Estimation
- Authors: Swati Jindal, Mohit Yadav, Roberto Manduchi,
- Abstract summary: We propose a simple and novel deep learning model designed to estimate gaze from videos.
Our method employs a spatial attention mechanism that tracks spatial dynamics within videos.
Experimental results confirm the efficacy of the proposed approach, demonstrating its success in both within-dataset and cross-dataset settings.
- Score: 7.545077734926115
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Gaze is an essential prompt for analyzing human behavior and attention. Recently, there has been an increasing interest in determining gaze direction from facial videos. However, video gaze estimation faces significant challenges, such as understanding the dynamic evolution of gaze in video sequences, dealing with static backgrounds, and adapting to variations in illumination. To address these challenges, we propose a simple and novel deep learning model designed to estimate gaze from videos, incorporating a specialized attention module. Our method employs a spatial attention mechanism that tracks spatial dynamics within videos. This technique enables accurate gaze direction prediction through a temporal sequence model, adeptly transforming spatial observations into temporal insights, thereby significantly improving gaze estimation accuracy. Additionally, our approach integrates Gaussian processes to include individual-specific traits, facilitating the personalization of our model with just a few labeled samples. Experimental results confirm the efficacy of the proposed approach, demonstrating its success in both within-dataset and cross-dataset settings. Specifically, our proposed approach achieves state-of-the-art performance on the Gaze360 dataset, improving by $2.5^\circ$ without personalization. Further, by personalizing the model with just three samples, we achieved an additional improvement of $0.8^\circ$. The code and pre-trained models are available at \url{https://github.com/jswati31/stage}.
Related papers
- TPP-Gaze: Modelling Gaze Dynamics in Space and Time with Neural Temporal Point Processes [63.95928298690001]
We present TPP-Gaze, a novel and principled approach to model scanpath dynamics based on Neural Temporal Point Process (TPP)
Our results show the overall superior performance of the proposed model compared to state-of-the-art approaches.
arXiv Detail & Related papers (2024-10-30T19:22:38Z) - MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.
By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.
We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following [74.30960564603917]
Training gaze following models requires a large number of images with gaze target coordinates annotated by human annotators.
We propose the first semi-supervised method for gaze following by introducing two novel priors to the task.
Our method outperforms simple pseudo-annotation generation baselines on the GazeFollow image dataset.
arXiv Detail & Related papers (2024-06-04T20:43:26Z) - GazeFusion: Saliency-guided Image Generation [50.37783903347613]
Diffusion models offer unprecedented image generation capabilities given just a text prompt.
We present a saliency-guided framework to incorporate the data priors of human visual attention into the generation process.
arXiv Detail & Related papers (2024-03-16T21:01:35Z) - DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose
Estimation [16.32910684198013]
We present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem.
We show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model.
arXiv Detail & Related papers (2023-07-31T14:00:23Z) - TempSAL -- Uncovering Temporal Information for Deep Saliency Prediction [64.63645677568384]
We introduce a novel saliency prediction model that learns to output saliency maps in sequential time intervals.
Our approach locally modulates the saliency predictions by combining the learned temporal maps.
Our code will be publicly available on GitHub.
arXiv Detail & Related papers (2023-01-05T22:10:16Z) - NeRF-Gaze: A Head-Eye Redirection Parametric Model for Gaze Estimation [37.977032771941715]
We propose a novel Head-Eye redirection parametric model based on Neural Radiance Field.
Our model can decouple the face and eyes for separate neural rendering.
It can achieve the purpose of separately controlling the attributes of the face, identity, illumination, and eye gaze direction.
arXiv Detail & Related papers (2022-12-30T13:52:28Z) - Improving saliency models' predictions of the next fixation with humans'
intrinsic cost of gaze shifts [6.315366433343492]
We develop a principled framework for predicting the next gaze target and the empirical measurement of the human cost for gaze.
We provide an implementation of human gaze preferences, which can be used to improve arbitrary saliency models' predictions of humans' next gaze targets.
arXiv Detail & Related papers (2022-07-09T11:21:13Z) - Learning-by-Novel-View-Synthesis for Full-Face Appearance-based 3D Gaze
Estimation [8.929311633814411]
This work examines a novel approach for synthesizing gaze estimation training data based on monocular 3D face reconstruction.
Unlike prior works using multi-view reconstruction, photo-realistic CG models, or generative neural networks, our approach can manipulate and extend the head pose range of existing training data.
arXiv Detail & Related papers (2022-01-20T00:29:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.