Related papers: Fixating on Attention: Integrating Human Eye Tracking into Vision Transformers

Fixating on Attention: Integrating Human Eye Tracking into Vision Transformers

URL: http://arxiv.org/abs/2308.13969v1
Date: Sat, 26 Aug 2023 22:48:06 GMT
Title: Fixating on Attention: Integrating Human Eye Tracking into Vision Transformers
Authors: Sharath Koorathota, Nikolas Papadopoulos, Jia Li Ma, Shruti Kumar, Xiaoxiao Sun, Arunesh Mittal, Patrick Adelman, Paul Sajda
Abstract summary: This work demonstrates how human visual input, specifically fixations collected from an eye-tracking device, can be integrated into transformer models to improve accuracy across multiple driving situations and datasets. We establish the significance of fixation regions in left-right driving decisions, as observed in both human subjects and a Vision Transformer (ViT) We incorporate information from the driving scene with fixation data, employing a "joint space-fixation" (JSF) attention setup. Lastly, we propose a "fixation-attention intersection" (FAX) loss to train the ViT model to attend to the same regions that humans fixated on
Score: 5.221681407166792
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern transformer-based models designed for computer vision have outperformed humans across a spectrum of visual tasks. However, critical tasks, such as medical image interpretation or autonomous driving, still require reliance on human judgments. This work demonstrates how human visual input, specifically fixations collected from an eye-tracking device, can be integrated into transformer models to improve accuracy across multiple driving situations and datasets. First, we establish the significance of fixation regions in left-right driving decisions, as observed in both human subjects and a Vision Transformer (ViT). By comparing the similarity between human fixation maps and ViT attention weights, we reveal the dynamics of overlap across individual heads and layers. This overlap is exploited for model pruning without compromising accuracy. Thereafter, we incorporate information from the driving scene with fixation data, employing a "joint space-fixation" (JSF) attention setup. Lastly, we propose a "fixation-attention intersection" (FAX) loss to train the ViT model to attend to the same regions that humans fixated on. We find that the ViT performance is improved in accuracy and number of training epochs when using JSF and FAX. These results hold significant implications for human-guided artificial intelligence.

Related papers

GTransPDM: A Graph-embedded Transformer with Positional Decoupling for Pedestrian Crossing Intention Prediction [6.327758022051579]
GTransPDM was developed for pedestrian crossing intention prediction by leveraging multi-modal features. It achieves 92% accuracy on the PIE dataset and 87% accuracy on the JAAD dataset, with a processing speed of 0.05ms.
arXiv Detail & Related papers (2024-09-30T12:02:17Z)
Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects [30.09778169168547]
Vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings. However, they exhibit surprising failures when performing tasks involving visual relations.
arXiv Detail & Related papers (2024-06-22T22:43:10Z)
Realtime Dynamic Gaze Target Tracking and Depth-Level Estimation [6.435984242701043]
Transparent Displays (TD) in various applications, such as Heads-Up Displays (HUDs) in vehicles, is a burgeoning field, poised to revolutionize user experiences. This innovation brings forth significant challenges in realtime human-device interaction, particularly in accurately identifying and tracking a user's gaze on dynamically changing TDs. We present a two-fold robust and efficient systematic solution for realtime gaze monitoring, comprised of: (1) a tree-based algorithm for identifying and dynamically tracking gaze targets; and (2) a multi-stream self-attention architecture to estimate the depth-level of human gaze from eye tracking data.
arXiv Detail & Related papers (2024-06-09T20:52:47Z)
Visualizing the loss landscape of Self-supervised Vision Transformer [53.84372035496475]
The Masked autoencoder (MAE) has drawn attention as a representative self-supervised approach for masked image modeling with vision transformers. We visualize the loss landscapes of the self-supervised vision transformer by both MAE and RC-MAE and compare them with the supervised ViT (Sup-ViT) To the best of our knowledge, this work is the first to investigate the self-supervised ViT through the lens of the loss landscape.
arXiv Detail & Related papers (2024-05-28T10:54:26Z)
Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration. In this work, we tackle the task of reconstructing closely interactive humans from a monocular video. We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z)
Simulation of a Vision Correction Display System [0.0]
This paper focuses on simulating a Vision Correction Display (VCD) to enhance the visual experience of individuals with various visual impairments. With these simulations we can see potential improvements in visual acuity and comfort.
arXiv Detail & Related papers (2024-04-12T04:45:51Z)
A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos [10.149523817328921]
We introduce a novel method for simulating human gaze behavior. Our approach uses a transformer-based reinforcement learning algorithm to train an agent that acts as a human observer.
arXiv Detail & Related papers (2024-04-10T21:14:33Z)
EgoNav: Egocentric Scene-aware Human Trajectory Prediction [15.346096596482857]
Wearable collaborative robots stand to assist human wearers who need fall prevention assistance or wear exoskeletons. Such a robot needs to be able to constantly adapt to the surrounding scene based on egocentric vision, and predict the ego motion of the wearer. In this work, we leveraged body-mounted cameras and sensors to anticipate the trajectory of human wearers through complex surroundings.
arXiv Detail & Related papers (2024-03-27T21:43:12Z)
Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior. Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z)
What Makes Pre-Trained Visual Representations Successful for Robust Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture. We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z)
Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers [40.27531644565077]
We propose the Human Attention Transformer (HAT), a single model that predicts both forms of attention control. HAT sets a new standard in computational attention, which emphasizes effectiveness, generality, and interpretability.
arXiv Detail & Related papers (2023-03-16T15:13:09Z)
What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision. This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs. We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z)
Multimodal Vision Transformers with Forced Attention for Behavior Analysis [0.0]
We introduce the Forced Attention (FAt) Transformer which utilize forced attention with a modified backbone for input encoding and a use of additional inputs. FAt Transformers are applied to two downstream tasks: personality recognition and body language recognition. We achieve state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group Interaction datasets.
arXiv Detail & Related papers (2022-12-07T21:56:50Z)
Visualizing and Understanding Patch Interactions in Vision Transformer [96.70401478061076]
Vision Transformer (ViT) has become a leading tool in various computer vision tasks. We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
arXiv Detail & Related papers (2022-03-11T13:48:11Z)
Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN) We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks. To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame. Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)
Where is my hand? Deep hand segmentation for visual self-recognition in humanoid robots [129.46920552019247]
We propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view. We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy.
arXiv Detail & Related papers (2021-02-09T10:34:32Z)
Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images. Our approach is fully automatic without any human interaction. We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)
Learning Accurate and Human-Like Driving using Semantic Maps and Attention [152.48143666881418]
This paper investigates how end-to-end driving models can be improved to drive more accurately and human-like. We exploit semantic and visual maps from HERE Technologies and augment the existing Drive360 dataset with such. Our models are trained and evaluated on the Drive360 + HERE dataset, which features 60 hours and 3000 km of real-world driving data.
arXiv Detail & Related papers (2020-07-10T22:25:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.