Related papers: Human Scanpath Prediction in Target-Present Visual Search with Semantic-Foveal Bayesian Attention

Human Scanpath Prediction in Target-Present Visual Search with Semantic-Foveal Bayesian Attention

URL: http://arxiv.org/abs/2507.18503v1
Date: Thu, 24 Jul 2025 15:19:23 GMT
Title: Human Scanpath Prediction in Target-Present Visual Search with Semantic-Foveal Bayesian Attention
Authors: João Luzio, Alexandre Bernardino, Plinio Moreno,
Abstract summary: SemBA-FAST is a top-down framework designed for predicting human visual attention in target-present visual search.<n>We evaluate SemBA-FAST on the COCO-Search18 benchmark dataset, comparing its performance against other scanpath prediction models.<n>These findings provide valuable insights into the capabilities of semantic-foveal probabilistic frameworks for human-like attention modelling.
Score: 49.99728312519117
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In goal-directed visual tasks, human perception is guided by both top-down and bottom-up cues. At the same time, foveal vision plays a crucial role in directing attention efficiently. Modern research on bio-inspired computational attention models has taken advantage of advancements in deep learning by utilizing human scanpath data to achieve new state-of-the-art performance. In this work, we assess the performance of SemBA-FAST, i.e. Semantic-based Bayesian Attention for Foveal Active visual Search Tasks, a top-down framework designed for predicting human visual attention in target-present visual search. SemBA-FAST integrates deep object detection with a probabilistic semantic fusion mechanism to generate attention maps dynamically, leveraging pre-trained detectors and artificial foveation to update top-down knowledge and improve fixation prediction sequentially. We evaluate SemBA-FAST on the COCO-Search18 benchmark dataset, comparing its performance against other scanpath prediction models. Our methodology achieves fixation sequences that closely match human ground-truth scanpaths. Notably, it surpasses baseline and other top-down approaches and competes, in some cases, with scanpath-informed models. These findings provide valuable insights into the capabilities of semantic-foveal probabilistic frameworks for human-like attention modelling, with implications for real-time cognitive computing and robotics.

Related papers

Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans. This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations. In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z)
Contrastive Language-Image Pretrained Models are Zero-Shot Human Scanpath Predictors [2.524526956420465]
CapMIT1003 is a database of captions and click-contingent image explorations collected during captioning tasks. NevaClip is a novel zero-shot method for predicting visual scanpaths.
arXiv Detail & Related papers (2023-05-21T07:24:50Z)
Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z)
Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers [40.27531644565077]
We propose the Human Attention Transformer (HAT), a single model that predicts both forms of attention control. HAT sets a new standard in computational attention, which emphasizes effectiveness, generality, and interpretability.
arXiv Detail & Related papers (2023-03-16T15:13:09Z)
An Inter-observer consistent deep adversarial training for visual scanpath prediction [66.46953851227454]
We propose an inter-observer consistent adversarial training approach for scanpath prediction through a lightweight deep neural network. We show the competitiveness of our approach in regard to state-of-the-art methods.
arXiv Detail & Related papers (2022-11-14T13:22:29Z)
Knowledge Guided Bidirectional Attention Network for Human-Object Interaction Detection [3.0915392100355192]
We argue that the independent use of the bottom-up parsing strategy in HOI is counter-intuitive and could lead to the diffusion of attention. We introduce a novel knowledge-guided top-down attention into HOI, and propose to model the relation parsing as a "look and search" process. We implement the process via unifying the bottom-up and top-down attention in a single encoder-decoder based model.
arXiv Detail & Related papers (2022-07-16T16:42:49Z)
Behind the Machine's Gaze: Biologically Constrained Neural Networks Exhibit Human-like Visual Attention [40.878963450471026]
We propose the Neural Visual Attention (NeVA) algorithm to generate visual scanpaths in a top-down manner. We show that the proposed method outperforms state-of-the-art unsupervised human attention models in terms of similarity to human scanpaths.
arXiv Detail & Related papers (2022-04-19T18:57:47Z)
SALYPATH: A Deep-Based Architecture for visual attention prediction [5.068678962285629]
Visual attention is useful for many computer vision applications such as image compression, recognition, and captioning. We propose an end-to-end deep-based method, so-called SALYPATH, that efficiently predicts the scanpath of an image through features of a saliency model. The idea is predict the scanpath by exploiting the capacity of a deep-based model to predict the saliency.
arXiv Detail & Related papers (2021-06-29T08:53:51Z)
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks. To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame. Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)
Variational Structured Attention Networks for Deep Visual Representation Learning [49.80498066480928]
We propose a unified deep framework to jointly learn both spatial attention maps and channel attention in a principled manner. Specifically, we integrate the estimation and the interaction of the attentions within a probabilistic representation learning framework. We implement the inference rules within the neural network, thus allowing for end-to-end learning of the probabilistic and the CNN front-end parameters.
arXiv Detail & Related papers (2021-03-05T07:37:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.