Robust Multiview Multimodal Driver Monitoring System Using Masked
Multi-Head Self-Attention
- URL: http://arxiv.org/abs/2304.06370v1
- Date: Thu, 13 Apr 2023 09:50:32 GMT
- Title: Robust Multiview Multimodal Driver Monitoring System Using Masked
Multi-Head Self-Attention
- Authors: Yiming Ma, Victor Sanchez, Soodeh Nikan, Devesh Upadhyay, Bhushan
Atote, Tanaya Guha
- Abstract summary: We propose a novel multiview multimodal driver monitoring system based on feature-level fusion through multi-head self-attention (MHSA)
We demonstrate its effectiveness by comparing it against four alternative fusion strategies (Sum, Convarity, SE, and AFF)
Experiments on this enhanced database demonstrate that 1) the proposed MHSA-based fusion method (AUC-ROC: 97.0%) outperforms all baselines and previous approaches, and 2) training MHSA with patch masking can improve its robustness against modality/view collapses.
- Score: 28.18784311981388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Driver Monitoring Systems (DMSs) are crucial for safe hand-over actions in
Level-2+ self-driving vehicles. State-of-the-art DMSs leverage multiple sensors
mounted at different locations to monitor the driver and the vehicle's interior
scene and employ decision-level fusion to integrate these heterogenous data.
However, this fusion method may not fully utilize the complementarity of
different data sources and may overlook their relative importance. To address
these limitations, we propose a novel multiview multimodal driver monitoring
system based on feature-level fusion through multi-head self-attention (MHSA).
We demonstrate its effectiveness by comparing it against four alternative
fusion strategies (Sum, Conv, SE, and AFF). We also present a novel
GPU-friendly supervised contrastive learning framework SuMoCo to learn better
representations. Furthermore, We fine-grained the test split of the DAD dataset
to enable the multi-class recognition of drivers' activities. Experiments on
this enhanced database demonstrate that 1) the proposed MHSA-based fusion
method (AUC-ROC: 97.0\%) outperforms all baselines and previous approaches, and
2) training MHSA with patch masking can improve its robustness against
modality/view collapses. The code and annotations are publicly available.
Related papers
- Graph-Based Multi-Modal Sensor Fusion for Autonomous Driving [3.770103075126785]
We introduce a novel approach to multi-modal sensor fusion, focusing on developing a graph-based state representation.
We present a Sensor-Agnostic Graph-Aware Kalman Filter, the first online state estimation technique designed to fuse multi-modal graphs.
We validate the effectiveness of our proposed framework through extensive experiments conducted on both synthetic and real-world driving datasets.
arXiv Detail & Related papers (2024-11-06T06:58:17Z) - Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition [10.060717595852271]
We propose a novel multimodal fusion transformer, named MultiFuser.
It identifies cross-modal interrelations and interactions among multimodal car cabin videos.
Extensive experiments are conducted on Drive&Act dataset.
arXiv Detail & Related papers (2024-08-03T12:33:21Z) - M2DA: Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous Driving [11.36165122994834]
We propose a Multi-Modal fusion transformer incorporating Driver Attention (M2DA) for autonomous driving.
By incorporating driver attention, we empower the human-like scene understanding ability to autonomous vehicles to identify crucial areas precisely and ensure safety.
arXiv Detail & Related papers (2024-03-19T08:54:52Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - G-MEMP: Gaze-Enhanced Multimodal Ego-Motion Prediction in Driving [71.9040410238973]
We focus on inferring the ego trajectory of a driver's vehicle using their gaze data.
Next, we develop G-MEMP, a novel multimodal ego-trajectory prediction network that combines GPS and video input with gaze data.
The results show that G-MEMP significantly outperforms state-of-the-art methods in both benchmarks.
arXiv Detail & Related papers (2023-12-13T23:06:30Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - AIDE: A Vision-Driven Multi-View, Multi-Modal, Multi-Tasking Dataset for
Assistive Driving Perception [26.84439405241999]
We present an AssIstive Driving pErception dataset (AIDE) that considers context information both inside and outside the vehicle.
AIDE facilitates holistic driver monitoring through three distinctive characteristics.
Two fusion strategies are introduced to give new insights into learning effective multi-stream/modal representations.
arXiv Detail & Related papers (2023-07-26T03:12:05Z) - Towards Robust On-Ramp Merging via Augmented Multimodal Reinforcement
Learning [9.48157144651867]
We present a novel approach for Robust on-ramp merge of CAVs via Augmented and Multi-modal Reinforcement Learning.
Specifically, we formulate the on-ramp merging problem as a Markov decision process (MDP) by taking driving safety, comfort driving behavior, and traffic efficiency into account.
To provide reliable merging maneuvers, we simultaneously leverage BSM and surveillance images for multi-modal observation.
arXiv Detail & Related papers (2022-07-21T16:34:57Z) - Multimodal Object Detection via Bayesian Fusion [59.31437166291557]
We study multimodal object detection with RGB and thermal cameras, since the latter can provide much stronger object signatures under poor illumination.
Our key contribution is a non-learned late-fusion method that fuses together bounding box detections from different modalities.
We apply our approach to benchmarks containing both aligned (KAIST) and unaligned (FLIR) multimodal sensor data.
arXiv Detail & Related papers (2021-04-07T04:03:20Z) - DMD: A Large-Scale Multi-Modal Driver Monitoring Dataset for Attention
and Alertness Analysis [54.198237164152786]
Vision is the richest and most cost-effective technology for Driver Monitoring Systems (DMS)
The lack of sufficiently large and comprehensive datasets is currently a bottleneck for the progress of DMS development.
In this paper, we introduce the Driver Monitoring dataset (DMD), an extensive dataset which includes real and simulated driving scenarios.
arXiv Detail & Related papers (2020-08-27T12:33:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.