Related papers: DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images

DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images

URL: http://arxiv.org/abs/2403.17477v1
Date: Tue, 26 Mar 2024 08:13:02 GMT
Title: DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images
Authors: Chuhan Jiao, Yao Wang, Guanhua Zhang, Mihai Bâce, Zhiming Hu, Andreas Bulling,
Abstract summary: We present DiffGaze, a novel method for generating realistic and diverse continuous human gaze sequences on 360deg images. Our evaluations show that DiffGaze outperforms state-of-the-art methods on all tasks.
Score: 17.714378486267055
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present DiffGaze, a novel method for generating realistic and diverse continuous human gaze sequences on 360{\deg} images based on a conditional score-based denoising diffusion model. Generating human gaze on 360{\deg} images is important for various human-computer interaction and computer graphics applications, e.g. for creating large-scale eye tracking datasets or for realistic animation of virtual humans. However, existing methods are limited to predicting discrete fixation sequences or aggregated saliency maps, thereby neglecting crucial parts of natural gaze behaviour. Our method uses features extracted from 360{\deg} images as condition and uses two transformers to model the temporal and spatial dependencies of continuous human gaze. We evaluate DiffGaze on two 360{\deg} image benchmarks for gaze sequence generation as well as scanpath prediction and saliency prediction. Our evaluations show that DiffGaze outperforms state-of-the-art methods on all tasks on both benchmarks. We also report a 21-participant user study showing that our method generates gaze sequences that are indistinguishable from real human sequences.

Related papers

Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction [66.71402249062777]
We present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths.<n>Our method explicitly models scanpath variability by leveraging the nature of diffusion models, producing a wide range of plausible gaze trajectories.<n>Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios.
arXiv Detail & Related papers (2025-07-30T18:36:09Z)
Enhancing 3D Gaze Estimation in the Wild using Weak Supervision with Gaze Following Labels [10.827081942898506]
We introduce a novel Self-Training Weakly-Supervised Gaze Estimation framework (ST-WSGE) We propose the Gaze Transformer (GaT), a modality-agnostic architecture capable of simultaneously learning static and dynamic gaze information from both image and video datasets. By combining 3D video datasets with 2D gaze target labels from gaze following tasks, our approach achieves the following key contributions.
arXiv Detail & Related papers (2025-02-27T16:35:25Z)
Consistent Human Image and Video Generation with Spatially Conditioned Diffusion [82.4097906779699]
Consistent human-centric image and video synthesis aims to generate images with new poses while preserving appearance consistency with a given reference image. We frame the task as a spatially-conditioned inpainting problem, where the target image is in-painted to maintain appearance consistency with the reference. This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network.
arXiv Detail & Related papers (2024-12-19T05:02:30Z)
Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape. We collect 35K trials of behavioral data from over 500 participants. We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z)
Look Hear: Gaze Prediction for Speech-directed Human Attention [49.81718760025951]
Our study focuses on the incremental prediction of attention as a person is seeing an image and hearing a referring expression. We developed the Attention in Referral Transformer model or ART, which predicts the human fixations spurred by each word in a referring expression. In our quantitative and qualitative analyses, ART not only outperforms existing methods in scanpath prediction, but also appears to capture several human attention patterns.
arXiv Detail & Related papers (2024-07-28T22:35:08Z)
Spatio-Temporal Attention and Gaussian Processes for Personalized Video Gaze Estimation [7.545077734926115]
We propose a simple and novel deep learning model designed to estimate gaze from videos. Our method employs a spatial attention mechanism that tracks spatial dynamics within videos. Experimental results confirm the efficacy of the proposed approach, demonstrating its success in both within-dataset and cross-dataset settings.
arXiv Detail & Related papers (2024-04-08T06:07:32Z)
GazeMoDiff: Gaze-guided Diffusion Model for Stochastic Human Motion Prediction [10.982807572404166]
We present GazeMo - a novel gaze-guided denoising diffusion model to generate human motions. Our method first uses a gaze encoder to extract the gaze and motion features respectively, then employs a graph attention network to fuse these features. Our method outperforms the state-of-the-art methods by a large margin in terms of multi-modal final error.
arXiv Detail & Related papers (2023-12-19T12:10:12Z)
GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians [51.46168990249278]
We present an efficient approach to creating realistic human avatars with dynamic 3D appearances from a single video. GustafAvatar is validated on both the public dataset and our collected dataset.
arXiv Detail & Related papers (2023-12-04T18:55:45Z)
MRGAN360: Multi-stage Recurrent Generative Adversarial Network for 360 Degree Image Saliency Prediction [10.541086214760497]
We propose a novel multi-stage recurrent generative adversarial networks for ODIs dubbed MRGAN360. At each stage, the prediction model takes as input the original image and the output of the previous stage and outputs a more accurate saliency map. We employ a recurrent neural network among adjacent prediction stages to model their correlations, and exploit a discriminator at the end of each stage to supervise the output saliency map.
arXiv Detail & Related papers (2023-03-15T11:15:03Z)
Active Gaze Control for Foveal Scene Exploration [124.11737060344052]
We propose a methodology to emulate how humans and robots with foveal cameras would explore a scene. The proposed method achieves an increase in detection F1-score of 2-3 percentage points for the same number of gaze shifts.
arXiv Detail & Related papers (2022-08-24T14:59:28Z)
GazeOnce: Real-Time Multi-Person Gaze Estimation [18.16091280655655]
Appearance-based gaze estimation aims to predict the 3D eye gaze direction from a single image. Recent deep learning-based approaches have demonstrated excellent performance, but cannot output multi-person gaze in real time. We propose GazeOnce, which is capable of simultaneously predicting gaze directions for multiple faces in an image.
arXiv Detail & Related papers (2022-04-20T14:21:47Z)
Non-Homogeneous Haze Removal via Artificial Scene Prior and Bidimensional Graph Reasoning [52.07698484363237]
We propose a Non-Homogeneous Haze Removal Network (NHRN) via artificial scene prior and bidimensional graph reasoning. Our method achieves superior performance over many state-of-the-art algorithms for both the single image dehazing and hazy image understanding tasks.
arXiv Detail & Related papers (2021-04-05T13:04:44Z)
360-Degree Gaze Estimation in the Wild Using Multiple Zoom Scales [26.36068336169795]
We develop a model that mimics humans' ability to estimate the gaze by aggregating from focused looks. The model avoids the need to extract clear eye patches. We extend the model to handle the challenging task of 360-degree gaze estimation.
arXiv Detail & Related papers (2020-09-15T08:45:12Z)
Appearance Consensus Driven Self-Supervised Human Mesh Recovery [67.20942777949793]
We present a self-supervised human mesh recovery framework to infer human pose and shape from monocular images. We achieve state-of-the-art results on the standard model-based 3D pose estimation benchmarks. The resulting colored mesh prediction opens up the usage of our framework for a variety of appearance-related tasks beyond the pose and shape estimation.
arXiv Detail & Related papers (2020-08-04T05:40:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.