M$^2$DAR: Multi-View Multi-Scale Driver Action Recognition with Vision
Transformer
- URL: http://arxiv.org/abs/2305.08877v1
- Date: Sat, 13 May 2023 02:38:15 GMT
- Title: M$^2$DAR: Multi-View Multi-Scale Driver Action Recognition with Vision
Transformer
- Authors: Yunsheng Ma, Liangqi Yuan, Amr Abdelraouf, Kyungtae Han, Rohit Gupta,
Zihao Li, Ziran Wang
- Abstract summary: We present a multi-view, multi-scale framework for naturalistic driving action recognition and localization in untrimmed videos.
Our system features a weight-sharing, multi-scale Transformer-based action recognition network that learns robust hierarchical representations.
- Score: 5.082919518353888
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Ensuring traffic safety and preventing accidents is a critical goal in daily
driving, where the advancement of computer vision technologies can be leveraged
to achieve this goal. In this paper, we present a multi-view, multi-scale
framework for naturalistic driving action recognition and localization in
untrimmed videos, namely M$^2$DAR, with a particular focus on detecting
distracted driving behaviors. Our system features a weight-sharing, multi-scale
Transformer-based action recognition network that learns robust hierarchical
representations. Furthermore, we propose a new election algorithm consisting of
aggregation, filtering, merging, and selection processes to refine the
preliminary results from the action recognition module across multiple views.
Extensive experiments conducted on the 7th AI City Challenge Track 3 dataset
demonstrate the effectiveness of our approach, where we achieved an overlap
score of 0.5921 on the A2 test set. Our source code is available at
\url{https://github.com/PurdueDigitalTwin/M2DAR}.
Related papers
- Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision.
We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos.
Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z) - Lifting Multi-View Detection and Tracking to the Bird's Eye View [5.679775668038154]
Recent advancements in multi-view detection and 3D object recognition have significantly improved performance.
We compare modern lifting methods, both parameter-free and parameterized, to multi-view aggregation.
We present an architecture that aggregates the features of multiple times steps to learn robust detection.
arXiv Detail & Related papers (2024-03-19T09:33:07Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Egocentric RGB+Depth Action Recognition in Industry-Like Settings [50.38638300332429]
Our work focuses on recognizing actions from egocentric RGB and Depth modalities in an industry-like environment.
Our framework is based on the 3D Video SWIN Transformer to encode both RGB and Depth modalities effectively.
Our method also secured first place at the multimodal action recognition challenge at ICIAP 2023.
arXiv Detail & Related papers (2023-09-25T08:56:22Z) - AIDE: A Vision-Driven Multi-View, Multi-Modal, Multi-Tasking Dataset for
Assistive Driving Perception [26.84439405241999]
We present an AssIstive Driving pErception dataset (AIDE) that considers context information both inside and outside the vehicle.
AIDE facilitates holistic driver monitoring through three distinctive characteristics.
Two fusion strategies are introduced to give new insights into learning effective multi-stream/modal representations.
arXiv Detail & Related papers (2023-07-26T03:12:05Z) - LiDAR-BEVMTN: Real-Time LiDAR Bird's-Eye View Multi-Task Perception Network for Autonomous Driving [12.713417063678335]
We present a real-time multi-task convolutional neural network for LiDAR-based object detection, semantics, and motion segmentation.
We propose a novel Semantic Weighting and Guidance (SWAG) module to transfer semantic features for improved object detection selectively.
We achieve state-of-the-art results for two tasks, semantic and motion segmentation, and close to state-of-the-art performance for 3D object detection.
arXiv Detail & Related papers (2023-07-17T21:22:17Z) - A novel efficient Multi-view traffic-related object detection framework [17.50049841016045]
We propose a novel traffic-related framework named CEVAS to achieve efficient object detection using multi-view video data.
Results show that our framework significantly reduces response latency while achieving the same detection accuracy as the state-of-the-art methods.
arXiv Detail & Related papers (2023-02-23T06:42:37Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [59.60483620730437]
We propose TransFuser, a novel Multi-Modal Fusion Transformer, to integrate image and LiDAR representations using attention.
Our approach achieves state-of-the-art driving performance while reducing collisions by 76% compared to geometry-based fusion.
arXiv Detail & Related papers (2021-04-19T11:48:13Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z) - Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture.
We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions.
Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.