Fixating on Attention: Integrating Human Eye Tracking into Vision
Transformers
- URL: http://arxiv.org/abs/2308.13969v1
- Date: Sat, 26 Aug 2023 22:48:06 GMT
- Title: Fixating on Attention: Integrating Human Eye Tracking into Vision
Transformers
- Authors: Sharath Koorathota, Nikolas Papadopoulos, Jia Li Ma, Shruti Kumar,
Xiaoxiao Sun, Arunesh Mittal, Patrick Adelman, Paul Sajda
- Abstract summary: This work demonstrates how human visual input, specifically fixations collected from an eye-tracking device, can be integrated into transformer models to improve accuracy across multiple driving situations and datasets.
We establish the significance of fixation regions in left-right driving decisions, as observed in both human subjects and a Vision Transformer (ViT)
We incorporate information from the driving scene with fixation data, employing a "joint space-fixation" (JSF) attention setup. Lastly, we propose a "fixation-attention intersection" (FAX) loss to train the ViT model to attend to the same regions that humans fixated on
- Score: 5.221681407166792
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern transformer-based models designed for computer vision have
outperformed humans across a spectrum of visual tasks. However, critical tasks,
such as medical image interpretation or autonomous driving, still require
reliance on human judgments. This work demonstrates how human visual input,
specifically fixations collected from an eye-tracking device, can be integrated
into transformer models to improve accuracy across multiple driving situations
and datasets. First, we establish the significance of fixation regions in
left-right driving decisions, as observed in both human subjects and a Vision
Transformer (ViT). By comparing the similarity between human fixation maps and
ViT attention weights, we reveal the dynamics of overlap across individual
heads and layers. This overlap is exploited for model pruning without
compromising accuracy. Thereafter, we incorporate information from the driving
scene with fixation data, employing a "joint space-fixation" (JSF) attention
setup. Lastly, we propose a "fixation-attention intersection" (FAX) loss to
train the ViT model to attend to the same regions that humans fixated on. We
find that the ViT performance is improved in accuracy and number of training
epochs when using JSF and FAX. These results hold significant implications for
human-guided artificial intelligence.
Related papers
- GTransPDM: A Graph-embedded Transformer with Positional Decoupling for Pedestrian Crossing Intention Prediction [6.327758022051579]
GTransPDM was developed for pedestrian crossing intention prediction by leveraging multi-modal features.
It achieves 92% accuracy on the PIE dataset and 87% accuracy on the JAAD dataset, with a processing speed of 0.05ms.
arXiv Detail & Related papers (2024-09-30T12:02:17Z) - Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration.
In this work, we tackle the task of reconstructing closely interactive humans from a monocular video.
We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z) - EgoNav: Egocentric Scene-aware Human Trajectory Prediction [15.346096596482857]
Wearable collaborative robots stand to assist human wearers who need fall prevention assistance or wear exoskeletons.
Such a robot needs to be able to constantly adapt to the surrounding scene based on egocentric vision, and predict the ego motion of the wearer.
In this work, we leveraged body-mounted cameras and sensors to anticipate the trajectory of human wearers through complex surroundings.
arXiv Detail & Related papers (2024-03-27T21:43:12Z) - Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior.
Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z) - Multimodal Vision Transformers with Forced Attention for Behavior
Analysis [0.0]
We introduce the Forced Attention (FAt) Transformer which utilize forced attention with a modified backbone for input encoding and a use of additional inputs.
FAt Transformers are applied to two downstream tasks: personality recognition and body language recognition.
We achieve state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group Interaction datasets.
arXiv Detail & Related papers (2022-12-07T21:56:50Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - Where is my hand? Deep hand segmentation for visual self-recognition in
humanoid robots [129.46920552019247]
We propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view.
We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy.
arXiv Detail & Related papers (2021-02-09T10:34:32Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z) - Learning Accurate and Human-Like Driving using Semantic Maps and
Attention [152.48143666881418]
This paper investigates how end-to-end driving models can be improved to drive more accurately and human-like.
We exploit semantic and visual maps from HERE Technologies and augment the existing Drive360 dataset with such.
Our models are trained and evaluated on the Drive360 + HERE dataset, which features 60 hours and 3000 km of real-world driving data.
arXiv Detail & Related papers (2020-07-10T22:25:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.