GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths
- URL: http://arxiv.org/abs/2408.02788v1
- Date: Mon, 5 Aug 2024 19:11:46 GMT
- Title: GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths
- Authors: Xianyu Chen, Ming Jiang, Qi Zhao,
- Abstract summary: We introduce GazeXplain, a novel study of visual scanpath prediction and explanation.
This involves annotating natural-language explanations for fixations across eye-tracking datasets.
Experiments on diverse eye-tracking datasets demonstrate the effectiveness of GazeXplain in both scanpath prediction and explanation.
- Score: 20.384132849805003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While exploring visual scenes, humans' scanpaths are driven by their underlying attention processes. Understanding visual scanpaths is essential for various applications. Traditional scanpath models predict the where and when of gaze shifts without providing explanations, creating a gap in understanding the rationale behind fixations. To bridge this gap, we introduce GazeXplain, a novel study of visual scanpath prediction and explanation. This involves annotating natural-language explanations for fixations across eye-tracking datasets and proposing a general model with an attention-language decoder that jointly predicts scanpaths and generates explanations. It integrates a unique semantic alignment mechanism to enhance the consistency between fixations and explanations, alongside a cross-dataset co-training approach for generalization. These novelties present a comprehensive and adaptable solution for explainable human visual scanpath prediction. Extensive experiments on diverse eye-tracking datasets demonstrate the effectiveness of GazeXplain in both scanpath prediction and explanation, offering valuable insights into human visual attention and cognitive processes.
Related papers
- Caption-Driven Explorations: Aligning Image and Text Embeddings through Human-Inspired Foveated Vision [3.3295510777293837]
We introduce CapMIT1003, a dataset with captions and click-contingent image explorations, to study human attention during the captioning task.
We also present NevaClip, a zero-shot method for predicting visual scanpaths by combining CLIP models with NeVA algorithms.
arXiv Detail & Related papers (2024-08-19T12:41:46Z) - Look Hear: Gaze Prediction for Speech-directed Human Attention [49.81718760025951]
Our study focuses on the incremental prediction of attention as a person is seeing an image and hearing a referring expression.
We developed the Attention in Referral Transformer model or ART, which predicts the human fixations spurred by each word in a referring expression.
In our quantitative and qualitative analyses, ART not only outperforms existing methods in scanpath prediction, but also appears to capture several human attention patterns.
arXiv Detail & Related papers (2024-07-28T22:35:08Z) - Beyond Average: Individualized Visual Scanpath Prediction [20.384132849805003]
individualized scanpath prediction (ISP) aims to accurately predict how different individuals shift their attention in diverse visual tasks.
ISP features an observer encoder to characterize and integrate an observer's unique attention traits, an observer-centric feature integration approach, and an adaptive fixation prioritization mechanism.
Our method is generally applicable to different datasets, model architectures, and visual tasks, offering a comprehensive tool for transforming general scanpath models into individualized ones.
arXiv Detail & Related papers (2024-04-18T14:51:42Z) - Contrastive Language-Image Pretrained Models are Zero-Shot Human
Scanpath Predictors [2.524526956420465]
CapMIT1003 is a database of captions and click-contingent image explorations collected during captioning tasks.
NevaClip is a novel zero-shot method for predicting visual scanpaths.
arXiv Detail & Related papers (2023-05-21T07:24:50Z) - Semantic Prompt for Few-Shot Image Recognition [76.68959583129335]
We propose a novel Semantic Prompt (SP) approach for few-shot learning.
The proposed approach achieves promising results, improving the 1-shot learning accuracy by 3.67% on average.
arXiv Detail & Related papers (2023-03-24T16:32:19Z) - An Inter-observer consistent deep adversarial training for visual
scanpath prediction [66.46953851227454]
We propose an inter-observer consistent adversarial training approach for scanpath prediction through a lightweight deep neural network.
We show the competitiveness of our approach in regard to state-of-the-art methods.
arXiv Detail & Related papers (2022-11-14T13:22:29Z) - A domain adaptive deep learning solution for scanpath prediction of
paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings.
We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans.
The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z) - Learnable Visual Words for Interpretable Image Recognition [70.85686267987744]
We propose the Learnable Visual Words (LVW) to interpret the model prediction behaviors with two novel modules.
The semantic visual words learning relaxes the category-specific constraint, enabling the general visual words shared across different categories.
Our experiments on six visual benchmarks demonstrate the superior effectiveness of our proposed LVW in both accuracy and model interpretation.
arXiv Detail & Related papers (2022-05-22T03:24:45Z) - Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning.
It aims to extract both the common information and the complementary information in an adversarial setting.
In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z) - Deep semantic gaze embedding and scanpath comparison for expertise
classification during OPT viewing [6.700983301090583]
We present a novel approach to gaze scanpath comparison that incorporates convolutional neural networks (CNN)
Our approach was capable of distinguishing experts from novices with 93% accuracy while incorporating the image semantics.
arXiv Detail & Related papers (2020-03-31T07:00:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.