OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction
- URL: http://arxiv.org/abs/2407.13335v1
- Date: Thu, 18 Jul 2024 09:33:17 GMT
- Title: OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction
- Authors: Yini Fang, Jingling Yu, Haozheng Zhang, Ralf van der Lans, Bertram Shi,
- Abstract summary: This paper introduces the Object-level Attention Transformer (OAT)
OAT predicts human scanpaths as they search for a target object within a cluttered scene of distractors.
We evaluate OAT on the Amazon book cover dataset and a new dataset for visual search that we collected.
- Score: 0.2796197251957245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual search is important in our daily life. The efficient allocation of visual attention is critical to effectively complete visual search tasks. Prior research has predominantly modelled the spatial allocation of visual attention in images at the pixel level, e.g. using a saliency map. However, emerging evidence shows that visual attention is guided by objects rather than pixel intensities. This paper introduces the Object-level Attention Transformer (OAT), which predicts human scanpaths as they search for a target object within a cluttered scene of distractors. OAT uses an encoder-decoder architecture. The encoder captures information about the position and appearance of the objects within an image and about the target. The decoder predicts the gaze scanpath as a sequence of object fixations, by integrating output features from both the encoder and decoder. We also propose a new positional encoding that better reflects spatial relationships between objects. We evaluated OAT on the Amazon book cover dataset and a new dataset for visual search that we collected. OAT's predicted gaze scanpaths align more closely with human gaze patterns, compared to predictions by algorithms based on spatial attention on both established metrics and a novel behavioural-based metric. Our results demonstrate the generalization ability of OAT, as it accurately predicts human scanpaths for unseen layouts and target objects.
Related papers
- Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model [19.800353299691277]
This paper presents a more challenging gaze object segmentation (GOS) task, which involves inferring the pixel-level mask corresponding to the object captured by human gaze behavior.
We propose to automatically obtain head features from scene features to ensure the model's inference efficiency and flexibility in the real world.
arXiv Detail & Related papers (2024-08-02T06:32:45Z) - PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding [20.422852022310945]
3D visual grounding aims to identify objects in 3D point cloud scenes that match specific natural language descriptions.
This requires the model to not only focus on the target object itself but also to consider the surrounding environment.
We propose PD-APE, a dual-branch decoding framework that separately decodes target object attributes and surrounding layouts.
arXiv Detail & Related papers (2024-07-19T17:44:33Z) - Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans.
This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations.
In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z) - Selective Visual Representations Improve Convergence and Generalization
for Embodied AI [44.33711781750707]
Embodied AI models often employ off the shelf vision backbones like CLIP to encode their visual observations.
This introduces noise within the learning process and distracts the agent's focus from task-relevant visual cues.
Inspired by selective attention in humans-the process through which people filter their perception based on their experiences, knowledge, and the task at hand-we introduce a parameter-efficient approach to filter visual stimuli for embodied AI.
arXiv Detail & Related papers (2023-11-07T18:34:02Z) - Open-Vocabulary Object Detection via Scene Graph Discovery [53.27673119360868]
Open-vocabulary (OV) object detection has attracted increasing research attention.
We propose a novel Scene-Graph-Based Discovery Network (SGDN) that exploits scene graph cues for OV detection.
arXiv Detail & Related papers (2023-07-07T00:46:19Z) - SOOD: Towards Semi-Supervised Oriented Object Detection [57.05141794402972]
This paper proposes a novel Semi-supervised Oriented Object Detection model, termed SOOD, built upon the mainstream pseudo-labeling framework.
Our experiments show that when trained with the two proposed losses, SOOD surpasses the state-of-the-art SSOD methods under various settings on the DOTA-v1.5 benchmark.
arXiv Detail & Related papers (2023-04-10T11:10:42Z) - Predicting Visual Attention and Distraction During Visual Search Using
Convolutional Neural Networks [2.7920304852537527]
We present two approaches to model visual attention and distraction of observers during visual search.
Our first approach adapts a light-weight free-viewing saliency model to predict eye fixation density maps of human observers over pixels of search images.
Our second approach is object-based and predicts the distractor and target objects during visual search.
arXiv Detail & Related papers (2022-10-27T00:39:43Z) - Active Gaze Control for Foveal Scene Exploration [124.11737060344052]
We propose a methodology to emulate how humans and robots with foveal cameras would explore a scene.
The proposed method achieves an increase in detection F1-score of 2-3 percentage points for the same number of gaze shifts.
arXiv Detail & Related papers (2022-08-24T14:59:28Z) - SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots.
Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step.
This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z) - Graph Attention Tracking [76.19829750144564]
We propose a simple target-aware Siamese graph attention network for general object tracking.
Experiments on challenging benchmarks including GOT-10k, UAV123, OTB-100 and LaSOT demonstrate that the proposed SiamGAT outperforms many state-of-the-art trackers.
arXiv Detail & Related papers (2020-11-23T04:26:45Z) - Applying r-spatiogram in object tracking for occlusion handling [16.36552899280708]
The aim of video tracking is to accurately locate a moving target in a video sequence and discriminate target from non-targets in the feature space of the sequence.
In this paper, we use the basic idea of many trackers which consists of three main components of the reference model, i.e. object modeling, object detection and localization, and model updating.
arXiv Detail & Related papers (2020-03-18T02:42:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.