Related papers: M$^2$DAR: Multi-View Multi-Scale Driver Action Recognition with Vision Transformer

M$^2$DAR: Multi-View Multi-Scale Driver Action Recognition with Vision Transformer

URL: http://arxiv.org/abs/2305.08877v1
Date: Sat, 13 May 2023 02:38:15 GMT
Title: M$^2$DAR: Multi-View Multi-Scale Driver Action Recognition with Vision Transformer
Authors: Yunsheng Ma, Liangqi Yuan, Amr Abdelraouf, Kyungtae Han, Rohit Gupta, Zihao Li, Ziran Wang
Abstract summary: We present a multi-view, multi-scale framework for naturalistic driving action recognition and localization in untrimmed videos. Our system features a weight-sharing, multi-scale Transformer-based action recognition network that learns robust hierarchical representations.
Score: 5.082919518353888
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Ensuring traffic safety and preventing accidents is a critical goal in daily driving, where the advancement of computer vision technologies can be leveraged to achieve this goal. In this paper, we present a multi-view, multi-scale framework for naturalistic driving action recognition and localization in untrimmed videos, namely M$^2$DAR, with a particular focus on detecting distracted driving behaviors. Our system features a weight-sharing, multi-scale Transformer-based action recognition network that learns robust hierarchical representations. Furthermore, we propose a new election algorithm consisting of aggregation, filtering, merging, and selection processes to refine the preliminary results from the action recognition module across multiple views. Extensive experiments conducted on the 7th AI City Challenge Track 3 dataset demonstrate the effectiveness of our approach, where we achieved an overlap score of 0.5921 on the A2 test set. Our source code is available at \url{https://github.com/PurdueDigitalTwin/M2DAR}.

Related papers

Contrastive Learning-Driven Traffic Sign Perception: Multi-Modal Fusion of Text and Vision [2.0720154517628417]
We propose a novel framework combining open-vocabulary detection and cross-modal learning.<n>For traffic sign detection, our NanoVerse YOLO model integrates a vision-language path aggregation network (RepVL-PAN) and an SPD-Conv module.<n>For traffic sign classification, we designed a Traffic Sign Recognition Multimodal Contrastive Learning model (TSR-MCL)<n>On the TT100K dataset, our method achieves a state-of-the-art 78.4% mAP in the long-tail detection task for all-class recognition.
arXiv Detail & Related papers (2025-07-31T08:23:30Z)
DINO-CoDT: Multi-class Collaborative Detection and Tracking with Vision Foundation Models [11.34839442803445]
We propose a multi-class collaborative detection and tracking framework tailored for diverse road users.<n>We first present a detector with a global spatial attention fusion (GSAF) module, enhancing multi-scale feature learning for objects of varying sizes.<n>Next, we introduce a tracklet RE-IDentification (REID) module that leverages visual semantics with a vision foundation model to effectively reduce ID SWitch (IDSW) errors.
arXiv Detail & Related papers (2025-06-09T02:49:10Z)
Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision. We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos. Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z)
Lifting Multi-View Detection and Tracking to the Bird's Eye View [5.679775668038154]
Recent advancements in multi-view detection and 3D object recognition have significantly improved performance. We compare modern lifting methods, both parameter-free and parameterized, to multi-view aggregation. We present an architecture that aggregates the features of multiple times steps to learn robust detection.
arXiv Detail & Related papers (2024-03-19T09:33:07Z)
Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another. Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z)
Egocentric RGB+Depth Action Recognition in Industry-Like Settings [50.38638300332429]
Our work focuses on recognizing actions from egocentric RGB and Depth modalities in an industry-like environment. Our framework is based on the 3D Video SWIN Transformer to encode both RGB and Depth modalities effectively. Our method also secured first place at the multimodal action recognition challenge at ICIAP 2023.
arXiv Detail & Related papers (2023-09-25T08:56:22Z)
AIDE: A Vision-Driven Multi-View, Multi-Modal, Multi-Tasking Dataset for Assistive Driving Perception [26.84439405241999]
We present an AssIstive Driving pErception dataset (AIDE) that considers context information both inside and outside the vehicle. AIDE facilitates holistic driver monitoring through three distinctive characteristics. Two fusion strategies are introduced to give new insights into learning effective multi-stream/modal representations.
arXiv Detail & Related papers (2023-07-26T03:12:05Z)
LiDAR-BEVMTN: Real-Time LiDAR Bird's-Eye View Multi-Task Perception Network for Autonomous Driving [12.713417063678335]
We present a real-time multi-task convolutional neural network for LiDAR-based object detection, semantics, and motion segmentation. We propose a novel Semantic Weighting and Guidance (SWAG) module to transfer semantic features for improved object detection selectively. We achieve state-of-the-art results for two tasks, semantic and motion segmentation, and close to state-of-the-art performance for 3D object detection.
arXiv Detail & Related papers (2023-07-17T21:22:17Z)
A novel efficient Multi-view traffic-related object detection framework [17.50049841016045]
We propose a novel traffic-related framework named CEVAS to achieve efficient object detection using multi-view video data. Results show that our framework significantly reduces response latency while achieving the same detection accuracy as the state-of-the-art methods.
arXiv Detail & Related papers (2023-02-23T06:42:37Z)
A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving. We present SimMOD, a Simple baseline for Multi-camera Object Detection. We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z)
Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [59.60483620730437]
We propose TransFuser, a novel Multi-Modal Fusion Transformer, to integrate image and LiDAR representations using attention. Our approach achieves state-of-the-art driving performance while reducing collisions by 76% compared to geometry-based fusion.
arXiv Detail & Related papers (2021-04-19T11:48:13Z)
Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images. Our approach is fully automatic without any human interaction. We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)
Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture. We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions. Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.