Gaze-Informed Vision Transformers: Predicting Driving Decisions Under Uncertainty
- URL: http://arxiv.org/abs/2308.13969v2
- Date: Thu, 09 Jan 2025 20:08:31 GMT
- Title: Gaze-Informed Vision Transformers: Predicting Driving Decisions Under Uncertainty
- Authors: Sharath Koorathota, Nikolas Papadopoulos, Jia Li Ma, Shruti Kumar, Xiaoxiao Sun, Arunesh Mittal, Patrick Adelman, Paul Sajda,
- Abstract summary: Vision Transformers (ViT) have advanced computer vision, yet their efficacy in complex tasks like driving remains less explored.
This study enhances ViT by integrating human eye gaze, captured via eye-tracking, to increase prediction accuracy in driving scenarios under uncertainty.
- Score: 5.006068984003071
- License:
- Abstract: Vision Transformers (ViT) have advanced computer vision, yet their efficacy in complex tasks like driving remains less explored. This study enhances ViT by integrating human eye gaze, captured via eye-tracking, to increase prediction accuracy in driving scenarios under uncertainty in both real-world and virtual reality scenarios. First, we establish the significance of human eye gaze in left-right driving decisions, as observed in both human subjects and a ViT model. By comparing the similarity between human fixation maps and ViT attention weights, we reveal the dynamics of overlap across individual heads and layers. This overlap demonstrates that fixation data can guide the model in distributing its attention weights more effectively. We introduce the fixation-attention intersection (FAX) loss, a novel loss function that significantly improves ViT performance under high uncertainty conditions. Our results show that ViT, when trained with FAX loss, aligns its attention with human gaze patterns. This gaze-informed approach has significant potential for driver behavior analysis, as well as broader applications in human-centered AI systems, extending ViT's use to complex visual environments.
Related papers
- Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects [30.09778169168547]
Vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings.
However, they exhibit surprising failures when performing tasks involving visual relations.
arXiv Detail & Related papers (2024-06-22T22:43:10Z) - Realtime Dynamic Gaze Target Tracking and Depth-Level Estimation [6.435984242701043]
Transparent Displays (TD) in various applications, such as Heads-Up Displays (HUDs) in vehicles, is a burgeoning field, poised to revolutionize user experiences.
This innovation brings forth significant challenges in realtime human-device interaction, particularly in accurately identifying and tracking a user's gaze on dynamically changing TDs.
We present a two-fold robust and efficient systematic solution for realtime gaze monitoring, comprised of: (1) a tree-based algorithm for identifying and dynamically tracking gaze targets; and (2) a multi-stream self-attention architecture to estimate the depth-level of human gaze from eye tracking data.
arXiv Detail & Related papers (2024-06-09T20:52:47Z) - Visualizing the loss landscape of Self-supervised Vision Transformer [53.84372035496475]
The Masked autoencoder (MAE) has drawn attention as a representative self-supervised approach for masked image modeling with vision transformers.
We visualize the loss landscapes of the self-supervised vision transformer by both MAE and RC-MAE and compare them with the supervised ViT (Sup-ViT)
To the best of our knowledge, this work is the first to investigate the self-supervised ViT through the lens of the loss landscape.
arXiv Detail & Related papers (2024-05-28T10:54:26Z) - Simulation of a Vision Correction Display System [0.0]
This paper focuses on simulating a Vision Correction Display (VCD) to enhance the visual experience of individuals with various visual impairments.
With these simulations we can see potential improvements in visual acuity and comfort.
arXiv Detail & Related papers (2024-04-12T04:45:51Z) - A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos [10.149523817328921]
We introduce a novel method for simulating human gaze behavior.
Our approach uses a transformer-based reinforcement learning algorithm to train an agent that acts as a human observer.
arXiv Detail & Related papers (2024-04-10T21:14:33Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers [40.27531644565077]
We propose the Human Attention Transformer (HAT), a single model that predicts both forms of attention control.
HAT sets a new standard in computational attention, which emphasizes effectiveness, generality, and interpretability.
arXiv Detail & Related papers (2023-03-16T15:13:09Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - Visualizing and Understanding Patch Interactions in Vision Transformer [96.70401478061076]
Vision Transformer (ViT) has become a leading tool in various computer vision tasks.
We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
arXiv Detail & Related papers (2022-03-11T13:48:11Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.