MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition
- URL: http://arxiv.org/abs/2408.01766v2
- Date: Sat, 17 Aug 2024 09:32:36 GMT
- Title: MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition
- Authors: Ruoyu Wang, Wenqian Wang, Jianjun Gao, Dan Lin, Kim-Hui Yap, Bingbing Li,
- Abstract summary: We propose a novel multimodal fusion transformer, named MultiFuser.
It identifies cross-modal interrelations and interactions among multimodal car cabin videos.
Extensive experiments are conducted on Drive&Act dataset.
- Score: 10.060717595852271
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Driver action recognition, aiming to accurately identify drivers' behaviours, is crucial for enhancing driver-vehicle interactions and ensuring driving safety. Unlike general action recognition, drivers' environments are often challenging, being gloomy and dark, and with the development of sensors, various cameras such as IR and depth cameras have emerged for analyzing drivers' behaviors. Therefore, in this paper, we propose a novel multimodal fusion transformer, named MultiFuser, which identifies cross-modal interrelations and interactions among multimodal car cabin videos and adaptively integrates different modalities for improved representations. Specifically, MultiFuser comprises layers of Bi-decomposed Modules to model spatiotemporal features, with a modality synthesizer for multimodal features integration. Each Bi-decomposed Module includes a Modal Expertise ViT block for extracting modality-specific features and a Patch-wise Adaptive Fusion block for efficient cross-modal fusion. Extensive experiments are conducted on Drive&Act dataset and the results demonstrate the efficacy of our proposed approach.
Related papers
- Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving [65.04643267731122]
General MLLMs combined with CLIP often struggle to represent driving-specific scenarios accurately.
We propose the Hints of Prompt (HoP) framework, which introduces three key enhancements.
These hints are fused through a Hint Fusion module, enriching visual representations and enhancing multimodal reasoning.
arXiv Detail & Related papers (2024-11-20T06:58:33Z) - Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes [56.52618054240197]
We propose a novel, condition-aware multimodal fusion approach for robust semantic perception of driving scenes.
Our method, CAFuser, uses an RGB camera input to classify environmental conditions and generate a Condition Token that guides the fusion of multiple sensor modalities.
We set the new state of the art with CAFuser on the MUSES dataset with 59.7 PQ for multimodal panoptic segmentation and 78.2 mIoU for semantic segmentation, ranking first on the public benchmarks.
arXiv Detail & Related papers (2024-10-14T17:56:20Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - DMM: Disparity-guided Multispectral Mamba for Oriented Object Detection in Remote Sensing [8.530409994516619]
Multispectral oriented object detection faces challenges due to both inter-modal and intra-modal discrepancies.
We propose Disparity-guided Multispectral Mamba (DMM), a framework comprised of a Disparity-guided Cross-modal Fusion Mamba (DCFM) module, a Multi-scale Target-aware Attention (MTA) module, and a Target-Prior Aware (TPA) auxiliary task.
arXiv Detail & Related papers (2024-07-11T02:09:59Z) - M2DA: Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous Driving [11.36165122994834]
We propose a Multi-Modal fusion transformer incorporating Driver Attention (M2DA) for autonomous driving.
By incorporating driver attention, we empower the human-like scene understanding ability to autonomous vehicles to identify crucial areas precisely and ensure safety.
arXiv Detail & Related papers (2024-03-19T08:54:52Z) - Multi-modality action recognition based on dual feature shift in vehicle
cabin monitoring [13.621051517649937]
We propose a novel yet efficient multi-modality driver action recognition method based on dual feature shift, named DFS.
Experiments have been carried out to verify the effectiveness of the proposed DFS model on the Drive&Act dataset.
arXiv Detail & Related papers (2024-01-26T13:07:59Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [59.60483620730437]
We propose TransFuser, a novel Multi-Modal Fusion Transformer, to integrate image and LiDAR representations using attention.
Our approach achieves state-of-the-art driving performance while reducing collisions by 76% compared to geometry-based fusion.
arXiv Detail & Related papers (2021-04-19T11:48:13Z) - Low Rank Fusion based Transformers for Multimodal Sequences [9.507869508188266]
We present two methods for the Multimodal Sentiment and Emotion Recognition results on CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets.
We show that our models have lesser parameters, train faster and perform comparably to many larger fusion-based architectures.
arXiv Detail & Related papers (2020-07-04T08:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.