EyeFormer: Predicting Personalized Scanpaths with Transformer-Guided Reinforcement Learning
- URL: http://arxiv.org/abs/2404.10163v2
- Date: Sun, 21 Apr 2024 03:17:23 GMT
- Title: EyeFormer: Predicting Personalized Scanpaths with Transformer-Guided Reinforcement Learning
- Authors: Yue Jiang, Zixin Guo, Hamed Rezazadegan Tavakoli, Luis A. Leiva, Antti Oulasvirta,
- Abstract summary: We present EyeFormer, a machine learning model for predicting scanpaths in a visual user interface.
Our model has the unique capability of producing personalized predictions when given a few user scanpath samples.
It can predict full scanpath information, including fixation positions and duration, across individuals and various stimulus types.
- Score: 31.583764158565916
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: From a visual perception perspective, modern graphical user interfaces (GUIs) comprise a complex graphics-rich two-dimensional visuospatial arrangement of text, images, and interactive objects such as buttons and menus. While existing models can accurately predict regions and objects that are likely to attract attention ``on average'', so far there is no scanpath model capable of predicting scanpaths for an individual. To close this gap, we introduce EyeFormer, which leverages a Transformer architecture as a policy network to guide a deep reinforcement learning algorithm that controls gaze locations. Our model has the unique capability of producing personalized predictions when given a few user scanpath samples. It can predict full scanpath information, including fixation positions and duration, across individuals and various stimulus types. Additionally, we demonstrate applications in GUI layout optimization driven by our model. Our software and models will be publicly available.
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior.
Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z) - Visual Affordance Prediction for Guiding Robot Exploration [56.17795036091848]
We develop an approach for learning visual affordances for guiding robot exploration.
We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE.
We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.
arXiv Detail & Related papers (2023-05-28T17:53:09Z) - Scanpath Prediction in Panoramic Videos via Expected Code Length
Minimization [27.06179638588126]
We present a new criterion for scanpath prediction based on principles from lossy data compression.
This criterion suggests minimizing the expected code length of quantized scanpaths in a training set.
We also introduce a proportional-integral-derivative (PID) controller-based sampler to generate realistic human-like scanpaths.
arXiv Detail & Related papers (2023-05-04T04:10:47Z) - Interactive Visual Feature Search [8.255656003475268]
We introduce Visual Feature Search, a novel interactive visualization that is adaptable to any CNN.
Our tool allows a user to highlight an image region and search for images from a given dataset with the most similar model features.
We demonstrate how our tool elucidates different aspects of model behavior by performing experiments on a range of applications, such as in medical imaging and wildlife classification.
arXiv Detail & Related papers (2022-11-28T04:39:03Z) - Conditioned Human Trajectory Prediction using Iterative Attention Blocks [70.36888514074022]
We present a simple yet effective pedestrian trajectory prediction model aimed at pedestrians positions prediction in urban-like environments.
Our model is a neural-based architecture that can run several layers of attention blocks and transformers in an iterative sequential fashion.
We show that without explicit introduction of social masks, dynamical models, social pooling layers, or complicated graph-like structures, it is possible to produce on par results with SoTA models.
arXiv Detail & Related papers (2022-06-29T07:49:48Z) - A Graph-Enhanced Click Model for Web Search [67.27218481132185]
We propose a novel graph-enhanced click model (GraphCM) for web search.
We exploit both intra-session and inter-session information for the sparsity and cold-start problems.
arXiv Detail & Related papers (2022-06-17T08:32:43Z) - A Simple and efficient deep Scanpath Prediction [6.294759639481189]
We explore the efficiency of using common deep learning architectures, in a simple fully convolutional regressive manner.
We experiment how well these models can predict the scanpaths on 2 datasets.
We also compare the different leveraged backbone architectures based on their performances on the experiment to deduce which ones are the most suitable for the task.
arXiv Detail & Related papers (2021-12-08T22:43:45Z) - Scanpath Prediction on Information Visualisations [19.591855190022667]
We propose a model that learns to predict visual saliency and scanpaths on information visualisations.
We present in-depth analyses of gaze behaviour for different information visualisation elements on the popular MASSVIS dataset.
arXiv Detail & Related papers (2021-12-04T13:59:52Z) - A Variational Graph Autoencoder for Manipulation Action Recognition and
Prediction [1.1816942730023883]
We introduce a deep graph autoencoder to jointly learn recognition and prediction of manipulation tasks from symbolic scene graphs.
Our network has a variational autoencoder structure with two branches: one for identifying the input graph type and one for predicting the future graphs.
We benchmark our new model against different state-of-the-art methods on two different datasets, MANIAC and MSRC-9, and show that our proposed model can achieve better performance.
arXiv Detail & Related papers (2021-10-25T21:40:42Z) - Learning to Generate Scene Graph from Natural Language Supervision [52.18175340725455]
We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph.
We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
arXiv Detail & Related papers (2021-09-06T03:38:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.