Multi-modality action recognition based on dual feature shift in vehicle
cabin monitoring
- URL: http://arxiv.org/abs/2401.14838v1
- Date: Fri, 26 Jan 2024 13:07:59 GMT
- Title: Multi-modality action recognition based on dual feature shift in vehicle
cabin monitoring
- Authors: Dan Lin, Philip Hann Yung Lee, Yiming Li, Ruoyu Wang, Kim-Hui Yap,
Bingbing Li, and You Shing Ngim
- Abstract summary: We propose a novel yet efficient multi-modality driver action recognition method based on dual feature shift, named DFS.
Experiments have been carried out to verify the effectiveness of the proposed DFS model on the Drive&Act dataset.
- Score: 13.621051517649937
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Driver Action Recognition (DAR) is crucial in vehicle cabin monitoring
systems. In real-world applications, it is common for vehicle cabins to be
equipped with cameras featuring different modalities. However, multi-modality
fusion strategies for the DAR task within car cabins have rarely been studied.
In this paper, we propose a novel yet efficient multi-modality driver action
recognition method based on dual feature shift, named DFS. DFS first integrates
complementary features across modalities by performing modality feature
interaction. Meanwhile, DFS achieves the neighbour feature propagation within
single modalities, by feature shifting among temporal frames. To learn common
patterns and improve model efficiency, DFS shares feature extracting stages
among multiple modalities. Extensive experiments have been carried out to
verify the effectiveness of the proposed DFS model on the Drive\&Act dataset.
The results demonstrate that DFS achieves good performance and improves the
efficiency of multi-modality driver action recognition.
Related papers
- MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - Exploring Driving Behavior for Autonomous Vehicles Based on Gramian
Angular Field Vision Transformer [13.020654798874475]
This paper presents the Gramian Angular Field Vision Transformer (GAF-ViT) model, designed to analyze driving behavior.
The proposed-ViT model consists of three key components: Transformer Module, Channel Attention Module, and Multi-Channel ViT Module.
arXiv Detail & Related papers (2023-10-21T04:24:30Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - ASY-VRNet: Waterway Panoptic Driving Perception Model based on Asymmetric Fair Fusion of Vision and 4D mmWave Radar [7.2865477881451755]
Asymmetric Fair Fusion (AFF) modules designed to efficiently interact with independent features from both visual and radar modalities.
ASY-VRNet model processes image and radar features based on irregular super-pixel point sets.
Compared to other lightweight models, ASY-VRNet achieves state-of-the-art performance in object detection, semantic segmentation, and drivable-area segmentation.
arXiv Detail & Related papers (2023-08-20T14:53:27Z) - Robust Multiview Multimodal Driver Monitoring System Using Masked
Multi-Head Self-Attention [28.18784311981388]
We propose a novel multiview multimodal driver monitoring system based on feature-level fusion through multi-head self-attention (MHSA)
We demonstrate its effectiveness by comparing it against four alternative fusion strategies (Sum, Convarity, SE, and AFF)
Experiments on this enhanced database demonstrate that 1) the proposed MHSA-based fusion method (AUC-ROC: 97.0%) outperforms all baselines and previous approaches, and 2) training MHSA with patch masking can improve its robustness against modality/view collapses.
arXiv Detail & Related papers (2023-04-13T09:50:32Z) - Visual Exemplar Driven Task-Prompting for Unified Perception in
Autonomous Driving [100.3848723827869]
We present an effective multi-task framework, VE-Prompt, which introduces visual exemplars via task-specific prompting.
Specifically, we generate visual exemplars based on bounding boxes and color-based markers, which provide accurate visual appearances of target categories.
We bridge transformer-based encoders and convolutional layers for efficient and accurate unified perception in autonomous driving.
arXiv Detail & Related papers (2023-03-03T08:54:06Z) - Value Function is All You Need: A Unified Learning Framework for Ride
Hailing Platforms [57.21078336887961]
Large ride-hailing platforms, such as DiDi, Uber and Lyft, connect tens of thousands of vehicles in a city to millions of ride demands throughout the day.
We propose a unified value-based dynamic learning framework (V1D3) for tackling both tasks.
arXiv Detail & Related papers (2021-05-18T19:22:24Z) - HMS: Hierarchical Modality Selection for Efficient Video Recognition [69.2263841472746]
This paper introduces Hierarchical Modality Selection (HMS), a simple yet efficient multimodal learning framework for efficient video recognition.
HMS operates on a low-cost modality, i.e., audio clues, by default, and dynamically decides on-the-fly whether to use computationally-expensive modalities, including appearance and motion clues, on a per-input basis.
We conduct extensive experiments on two large-scale video benchmarks, FCVID and ActivityNet, and the results demonstrate the proposed approach can effectively explore multimodal information for improved classification performance.
arXiv Detail & Related papers (2021-04-20T04:47:04Z) - Shared Cross-Modal Trajectory Prediction for Autonomous Driving [24.07872495811019]
We propose a Cross-Modal Embedding framework that aims to benefit from the use of multiple input modalities.
An extensive evaluation is conducted to show the efficacy of the proposed framework using two benchmark driving datasets.
arXiv Detail & Related papers (2020-11-15T07:18:50Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - Shared Cross-Modal Trajectory Prediction for Autonomous Driving [24.07872495811019]
We propose a Cross-Modal Embedding framework that aims to benefit from the use of multiple input modalities.
An extensive evaluation is conducted to show the efficacy of the proposed framework using two benchmark driving datasets.
arXiv Detail & Related papers (2020-04-01T02:44:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.