Multi-modality action recognition based on dual feature shift in vehicle
cabin monitoring
- URL: http://arxiv.org/abs/2401.14838v1
- Date: Fri, 26 Jan 2024 13:07:59 GMT
- Title: Multi-modality action recognition based on dual feature shift in vehicle
cabin monitoring
- Authors: Dan Lin, Philip Hann Yung Lee, Yiming Li, Ruoyu Wang, Kim-Hui Yap,
Bingbing Li, and You Shing Ngim
- Abstract summary: We propose a novel yet efficient multi-modality driver action recognition method based on dual feature shift, named DFS.
Experiments have been carried out to verify the effectiveness of the proposed DFS model on the Drive&Act dataset.
- Score: 13.621051517649937
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Driver Action Recognition (DAR) is crucial in vehicle cabin monitoring
systems. In real-world applications, it is common for vehicle cabins to be
equipped with cameras featuring different modalities. However, multi-modality
fusion strategies for the DAR task within car cabins have rarely been studied.
In this paper, we propose a novel yet efficient multi-modality driver action
recognition method based on dual feature shift, named DFS. DFS first integrates
complementary features across modalities by performing modality feature
interaction. Meanwhile, DFS achieves the neighbour feature propagation within
single modalities, by feature shifting among temporal frames. To learn common
patterns and improve model efficiency, DFS shares feature extracting stages
among multiple modalities. Extensive experiments have been carried out to
verify the effectiveness of the proposed DFS model on the Drive\&Act dataset.
The results demonstrate that DFS achieves good performance and improves the
efficiency of multi-modality driver action recognition.
Related papers
- Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving.
Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner.
Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition [10.060717595852271]
We propose a novel multimodal fusion transformer, named MultiFuser.
It identifies cross-modal interrelations and interactions among multimodal car cabin videos.
Extensive experiments are conducted on Drive&Act dataset.
arXiv Detail & Related papers (2024-08-03T12:33:21Z) - ASY-VRNet: Waterway Panoptic Driving Perception Model based on Asymmetric Fair Fusion of Vision and 4D mmWave Radar [7.2865477881451755]
Asymmetric Fair Fusion (AFF) modules designed to efficiently interact with independent features from both visual and radar modalities.
ASY-VRNet model processes image and radar features based on irregular super-pixel point sets.
Compared to other lightweight models, ASY-VRNet achieves state-of-the-art performance in object detection, semantic segmentation, and drivable-area segmentation.
arXiv Detail & Related papers (2023-08-20T14:53:27Z) - Robust Multiview Multimodal Driver Monitoring System Using Masked
Multi-Head Self-Attention [28.18784311981388]
We propose a novel multiview multimodal driver monitoring system based on feature-level fusion through multi-head self-attention (MHSA)
We demonstrate its effectiveness by comparing it against four alternative fusion strategies (Sum, Convarity, SE, and AFF)
Experiments on this enhanced database demonstrate that 1) the proposed MHSA-based fusion method (AUC-ROC: 97.0%) outperforms all baselines and previous approaches, and 2) training MHSA with patch masking can improve its robustness against modality/view collapses.
arXiv Detail & Related papers (2023-04-13T09:50:32Z) - Visual Exemplar Driven Task-Prompting for Unified Perception in
Autonomous Driving [100.3848723827869]
We present an effective multi-task framework, VE-Prompt, which introduces visual exemplars via task-specific prompting.
Specifically, we generate visual exemplars based on bounding boxes and color-based markers, which provide accurate visual appearances of target categories.
We bridge transformer-based encoders and convolutional layers for efficient and accurate unified perception in autonomous driving.
arXiv Detail & Related papers (2023-03-03T08:54:06Z) - HMS: Hierarchical Modality Selection for Efficient Video Recognition [69.2263841472746]
This paper introduces Hierarchical Modality Selection (HMS), a simple yet efficient multimodal learning framework for efficient video recognition.
HMS operates on a low-cost modality, i.e., audio clues, by default, and dynamically decides on-the-fly whether to use computationally-expensive modalities, including appearance and motion clues, on a per-input basis.
We conduct extensive experiments on two large-scale video benchmarks, FCVID and ActivityNet, and the results demonstrate the proposed approach can effectively explore multimodal information for improved classification performance.
arXiv Detail & Related papers (2021-04-20T04:47:04Z) - Shared Cross-Modal Trajectory Prediction for Autonomous Driving [24.07872495811019]
We propose a Cross-Modal Embedding framework that aims to benefit from the use of multiple input modalities.
An extensive evaluation is conducted to show the efficacy of the proposed framework using two benchmark driving datasets.
arXiv Detail & Related papers (2020-11-15T07:18:50Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - Shared Cross-Modal Trajectory Prediction for Autonomous Driving [24.07872495811019]
We propose a Cross-Modal Embedding framework that aims to benefit from the use of multiple input modalities.
An extensive evaluation is conducted to show the efficacy of the proposed framework using two benchmark driving datasets.
arXiv Detail & Related papers (2020-04-01T02:44:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.