Related papers: Multi-modality action recognition based on dual feature shift in vehicle cabin monitoring

Multi-modality action recognition based on dual feature shift in vehicle cabin monitoring

URL: http://arxiv.org/abs/2401.14838v1
Date: Fri, 26 Jan 2024 13:07:59 GMT
Title: Multi-modality action recognition based on dual feature shift in vehicle cabin monitoring
Authors: Dan Lin, Philip Hann Yung Lee, Yiming Li, Ruoyu Wang, Kim-Hui Yap, Bingbing Li, and You Shing Ngim
Abstract summary: We propose a novel yet efficient multi-modality driver action recognition method based on dual feature shift, named DFS. Experiments have been carried out to verify the effectiveness of the proposed DFS model on the Drive&Act dataset.
Score: 13.621051517649937
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Driver Action Recognition (DAR) is crucial in vehicle cabin monitoring systems. In real-world applications, it is common for vehicle cabins to be equipped with cameras featuring different modalities. However, multi-modality fusion strategies for the DAR task within car cabins have rarely been studied. In this paper, we propose a novel yet efficient multi-modality driver action recognition method based on dual feature shift, named DFS. DFS first integrates complementary features across modalities by performing modality feature interaction. Meanwhile, DFS achieves the neighbour feature propagation within single modalities, by feature shifting among temporal frames. To learn common patterns and improve model efficiency, DFS shares feature extracting stages among multiple modalities. Extensive experiments have been carried out to verify the effectiveness of the proposed DFS model on the Drive\&Act dataset. The results demonstrate that DFS achieves good performance and improves the efficiency of multi-modality driver action recognition.

Related papers

Towards Intelligent Transportation with Pedestrians and Vehicles In-the-Loop: A Surveillance Video-Assisted Federated Digital Twin Framework [62.47416496137193]
We propose a surveillance video assisted federated digital twin (SV-FDT) framework to empower ITSs with pedestrians and vehicles in-the-loop. The architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation-based visual understanding, twin agent-based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime.
arXiv Detail & Related papers (2025-03-06T07:36:06Z)
Spatial-Temporal Perception with Causal Inference for Naturalistic Driving Action Recognition [6.115044825582411]
Naturalistic driving action recognition is essential for vehicle cabin monitoring systems. Previous approaches have struggled with practical implementation due to their limited ability to observe subtle behavioral differences. We propose a novel Spatial-Temporal Perception architecture that emphasizes both temporal information and spatial relationships.
arXiv Detail & Related papers (2025-03-06T04:28:11Z)
Driver Assistance System Based on Multimodal Data Hazard Detection [0.0]
This paper proposes a multimodal driver assistance detection system. It integrates road condition video, driver facial video, and audio data to enhance incident recognition accuracy.
arXiv Detail & Related papers (2025-02-05T09:02:39Z)
Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples. We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality. We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z)
DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving. Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner. Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z)
DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout. DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder. Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z)
MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition [10.060717595852271]
We propose a novel multimodal fusion transformer, named MultiFuser. It identifies cross-modal interrelations and interactions among multimodal car cabin videos. Extensive experiments are conducted on Drive&Act dataset.
arXiv Detail & Related papers (2024-08-03T12:33:21Z)
ASY-VRNet: Waterway Panoptic Driving Perception Model based on Asymmetric Fair Fusion of Vision and 4D mmWave Radar [7.2865477881451755]
Asymmetric Fair Fusion (AFF) modules designed to efficiently interact with independent features from both visual and radar modalities. ASY-VRNet model processes image and radar features based on irregular super-pixel point sets. Compared to other lightweight models, ASY-VRNet achieves state-of-the-art performance in object detection, semantic segmentation, and drivable-area segmentation.
arXiv Detail & Related papers (2023-08-20T14:53:27Z)
Robust Multiview Multimodal Driver Monitoring System Using Masked Multi-Head Self-Attention [28.18784311981388]
We propose a novel multiview multimodal driver monitoring system based on feature-level fusion through multi-head self-attention (MHSA) We demonstrate its effectiveness by comparing it against four alternative fusion strategies (Sum, Convarity, SE, and AFF) Experiments on this enhanced database demonstrate that 1) the proposed MHSA-based fusion method (AUC-ROC: 97.0%) outperforms all baselines and previous approaches, and 2) training MHSA with patch masking can improve its robustness against modality/view collapses.
arXiv Detail & Related papers (2023-04-13T09:50:32Z)
Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving [100.3848723827869]
We present an effective multi-task framework, VE-Prompt, which introduces visual exemplars via task-specific prompting. Specifically, we generate visual exemplars based on bounding boxes and color-based markers, which provide accurate visual appearances of target categories. We bridge transformer-based encoders and convolutional layers for efficient and accurate unified perception in autonomous driving.
arXiv Detail & Related papers (2023-03-03T08:54:06Z)
HMS: Hierarchical Modality Selection for Efficient Video Recognition [69.2263841472746]
This paper introduces Hierarchical Modality Selection (HMS), a simple yet efficient multimodal learning framework for efficient video recognition. HMS operates on a low-cost modality, i.e., audio clues, by default, and dynamically decides on-the-fly whether to use computationally-expensive modalities, including appearance and motion clues, on a per-input basis. We conduct extensive experiments on two large-scale video benchmarks, FCVID and ActivityNet, and the results demonstrate the proposed approach can effectively explore multimodal information for improved classification performance.
arXiv Detail & Related papers (2021-04-20T04:47:04Z)
Shared Cross-Modal Trajectory Prediction for Autonomous Driving [24.07872495811019]
We propose a Cross-Modal Embedding framework that aims to benefit from the use of multiple input modalities. An extensive evaluation is conducted to show the efficacy of the proposed framework using two benchmark driving datasets.
arXiv Detail & Related papers (2020-11-15T07:18:50Z)
Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition. The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections. The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)
Shared Cross-Modal Trajectory Prediction for Autonomous Driving [24.07872495811019]
We propose a Cross-Modal Embedding framework that aims to benefit from the use of multiple input modalities. An extensive evaluation is conducted to show the efficacy of the proposed framework using two benchmark driving datasets.
arXiv Detail & Related papers (2020-04-01T02:44:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.