Fixating on Attention: Integrating Human Eye Tracking into Vision
Transformers
- URL: http://arxiv.org/abs/2308.13969v1
- Date: Sat, 26 Aug 2023 22:48:06 GMT
- Title: Fixating on Attention: Integrating Human Eye Tracking into Vision
Transformers
- Authors: Sharath Koorathota, Nikolas Papadopoulos, Jia Li Ma, Shruti Kumar,
Xiaoxiao Sun, Arunesh Mittal, Patrick Adelman, Paul Sajda
- Abstract summary: This work demonstrates how human visual input, specifically fixations collected from an eye-tracking device, can be integrated into transformer models to improve accuracy across multiple driving situations and datasets.
We establish the significance of fixation regions in left-right driving decisions, as observed in both human subjects and a Vision Transformer (ViT)
We incorporate information from the driving scene with fixation data, employing a "joint space-fixation" (JSF) attention setup. Lastly, we propose a "fixation-attention intersection" (FAX) loss to train the ViT model to attend to the same regions that humans fixated on
- Score: 5.221681407166792
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern transformer-based models designed for computer vision have
outperformed humans across a spectrum of visual tasks. However, critical tasks,
such as medical image interpretation or autonomous driving, still require
reliance on human judgments. This work demonstrates how human visual input,
specifically fixations collected from an eye-tracking device, can be integrated
into transformer models to improve accuracy across multiple driving situations
and datasets. First, we establish the significance of fixation regions in
left-right driving decisions, as observed in both human subjects and a Vision
Transformer (ViT). By comparing the similarity between human fixation maps and
ViT attention weights, we reveal the dynamics of overlap across individual
heads and layers. This overlap is exploited for model pruning without
compromising accuracy. Thereafter, we incorporate information from the driving
scene with fixation data, employing a "joint space-fixation" (JSF) attention
setup. Lastly, we propose a "fixation-attention intersection" (FAX) loss to
train the ViT model to attend to the same regions that humans fixated on. We
find that the ViT performance is improved in accuracy and number of training
epochs when using JSF and FAX. These results hold significant implications for
human-guided artificial intelligence.
Related papers
- Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects [30.09778169168547]
Vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings.
However, they exhibit surprising failures when performing tasks involving visual relations.
arXiv Detail & Related papers (2024-06-22T22:43:10Z) - Realtime Dynamic Gaze Target Tracking and Depth-Level Estimation [6.435984242701043]
Transparent Displays (TD) in various applications, such as Heads-Up Displays (HUDs) in vehicles, is a burgeoning field, poised to revolutionize user experiences.
This innovation brings forth significant challenges in realtime human-device interaction, particularly in accurately identifying and tracking a user's gaze on dynamically changing TDs.
We present a two-fold robust and efficient systematic solution for realtime gaze monitoring, comprised of: (1) a tree-based algorithm for identifying and dynamically tracking gaze targets; and (2) a multi-stream self-attention architecture to estimate the depth-level of human gaze from eye tracking data.
arXiv Detail & Related papers (2024-06-09T20:52:47Z) - Visualizing the loss landscape of Self-supervised Vision Transformer [53.84372035496475]
The Masked autoencoder (MAE) has drawn attention as a representative self-supervised approach for masked image modeling with vision transformers.
We visualize the loss landscapes of the self-supervised vision transformer by both MAE and RC-MAE and compare them with the supervised ViT (Sup-ViT)
To the best of our knowledge, this work is the first to investigate the self-supervised ViT through the lens of the loss landscape.
arXiv Detail & Related papers (2024-05-28T10:54:26Z) - Simulation of a Vision Correction Display System [0.0]
This paper focuses on simulating a Vision Correction Display (VCD) to enhance the visual experience of individuals with various visual impairments.
With these simulations we can see potential improvements in visual acuity and comfort.
arXiv Detail & Related papers (2024-04-12T04:45:51Z) - A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos [10.149523817328921]
We introduce a novel method for simulating human gaze behavior.
Our approach uses a transformer-based reinforcement learning algorithm to train an agent that acts as a human observer.
arXiv Detail & Related papers (2024-04-10T21:14:33Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers [40.27531644565077]
We propose the Human Attention Transformer (HAT), a single model that predicts both forms of attention control.
HAT sets a new standard in computational attention, which emphasizes effectiveness, generality, and interpretability.
arXiv Detail & Related papers (2023-03-16T15:13:09Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - Visualizing and Understanding Patch Interactions in Vision Transformer [96.70401478061076]
Vision Transformer (ViT) has become a leading tool in various computer vision tasks.
We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
arXiv Detail & Related papers (2022-03-11T13:48:11Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.