Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction
- URL: http://arxiv.org/abs/2507.23021v1
- Date: Wed, 30 Jul 2025 18:36:09 GMT
- Title: Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction
- Authors: Giuseppe Cartella, Vittorio Cuculo, Alessandro D'Amelio, Marcella Cornia, Giuseppe Boccignone, Rita Cucchiara,
- Abstract summary: We present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths.<n>Our method explicitly models scanpath variability by leveraging the nature of diffusion models, producing a wide range of plausible gaze trajectories.<n>Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios.
- Score: 66.71402249062777
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research. Source code and models are publicly available at https://aimagelab.github.io/ScanDiff.
Related papers
- Human Scanpath Prediction in Target-Present Visual Search with Semantic-Foveal Bayesian Attention [49.99728312519117]
SemBA-FAST is a top-down framework designed for predicting human visual attention in target-present visual search.<n>We evaluate SemBA-FAST on the COCO-Search18 benchmark dataset, comparing its performance against other scanpath prediction models.<n>These findings provide valuable insights into the capabilities of semantic-foveal probabilistic frameworks for human-like attention modelling.
arXiv Detail & Related papers (2025-07-24T15:19:23Z) - SliderSpace: Decomposing the Visual Capabilities of Diffusion Models [50.82362500995365]
SliderSpace is a framework for automatically decomposing the visual capabilities of diffusion models.<n>It discovers multiple interpretable and diverse directions simultaneously from a single text prompt.<n>Our method produces more diverse and useful variations compared to baselines.
arXiv Detail & Related papers (2025-02-03T18:59:55Z) - Unified Dynamic Scanpath Predictors Outperform Individually Trained Neural Models [18.327960366321655]
We develop a deep learning-based social cue integration model for saliency prediction to predict scanpaths in videos.<n>We evaluate our approach on gaze of dynamic social scenes, observed under the free-viewing condition.<n>Results indicate that a single unified model, trained on all the observers' scanpaths, performs on par or better than individually trained models.
arXiv Detail & Related papers (2024-05-05T13:15:11Z) - EyeFormer: Predicting Personalized Scanpaths with Transformer-Guided Reinforcement Learning [31.583764158565916]
We present EyeFormer, a machine learning model for predicting scanpaths in a visual user interface.
Our model has the unique capability of producing personalized predictions when given a few user scanpath samples.
It can predict full scanpath information, including fixation positions and duration, across individuals and various stimulus types.
arXiv Detail & Related papers (2024-04-15T22:26:27Z) - ScanDL: A Diffusion Model for Generating Synthetic Scanpaths on Texts [0.5520145204626482]
Eye movements in reading play a crucial role in psycholinguistic research.
The scarcity of eye movement data and its unavailability at application time poses a major challenge for this line of research.
We propose ScanDL, a novel discrete sequence-to-sequence diffusion model that generates synthetic scanpaths on texts.
arXiv Detail & Related papers (2023-10-24T07:52:19Z) - Visual Affordance Prediction for Guiding Robot Exploration [56.17795036091848]
We develop an approach for learning visual affordances for guiding robot exploration.
We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE.
We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.
arXiv Detail & Related papers (2023-05-28T17:53:09Z) - Scanpath Prediction in Panoramic Videos via Expected Code Length
Minimization [27.06179638588126]
We present a new criterion for scanpath prediction based on principles from lossy data compression.
This criterion suggests minimizing the expected code length of quantized scanpaths in a training set.
We also introduce a proportional-integral-derivative (PID) controller-based sampler to generate realistic human-like scanpaths.
arXiv Detail & Related papers (2023-05-04T04:10:47Z) - An Inter-observer consistent deep adversarial training for visual
scanpath prediction [66.46953851227454]
We propose an inter-observer consistent adversarial training approach for scanpath prediction through a lightweight deep neural network.
We show the competitiveness of our approach in regard to state-of-the-art methods.
arXiv Detail & Related papers (2022-11-14T13:22:29Z) - A Simple and efficient deep Scanpath Prediction [6.294759639481189]
We explore the efficiency of using common deep learning architectures, in a simple fully convolutional regressive manner.
We experiment how well these models can predict the scanpaths on 2 datasets.
We also compare the different leveraged backbone architectures based on their performances on the experiment to deduce which ones are the most suitable for the task.
arXiv Detail & Related papers (2021-12-08T22:43:45Z) - Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching.
Our approach learns entirely using offline, unlabeled data.
We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.