Transformer-based Fusion of 2D-pose and Spatio-temporal Embeddings for
Distracted Driver Action Recognition
- URL: http://arxiv.org/abs/2403.06577v1
- Date: Mon, 11 Mar 2024 10:26:38 GMT
- Title: Transformer-based Fusion of 2D-pose and Spatio-temporal Embeddings for
Distracted Driver Action Recognition
- Authors: Erkut Akdag, Zeqi Zhu, Egor Bondarev, Peter H. N. De With
- Abstract summary: Temporal localization of driving actions over time is important for advanced driver-assistance systems and naturalistic driving studies.
We aim to improve the temporal localization and classification accuracy performance by adapting video action recognition and 2D human-based estimation networks to one model.
The model performs well on the A2 test set the 2023 NVIDIA AI City Challenge for naturalistic driving action recognition.
- Score: 8.841708075914353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Classification and localization of driving actions over time is important for
advanced driver-assistance systems and naturalistic driving studies. Temporal
localization is challenging because it requires robustness, reliability, and
accuracy. In this study, we aim to improve the temporal localization and
classification accuracy performance by adapting video action recognition and 2D
human-pose estimation networks to one model. Therefore, we design a
transformer-based fusion architecture to effectively combine 2D-pose features
and spatio-temporal features. The model uses 2D-pose features as the positional
embedding of the transformer architecture and spatio-temporal features as the
main input to the encoder of the transformer. The proposed solution is generic
and independent of the camera numbers and positions, giving frame-based class
probabilities as output. Finally, the post-processing step combines information
from different camera views to obtain final predictions and eliminate false
positives. The model performs well on the A2 test set of the 2023 NVIDIA AI
City Challenge for naturalistic driving action recognition, achieving the
overlap score of the organizer-defined distracted driver behaviour metric of
0.5079.
Related papers
- GTransPDM: A Graph-embedded Transformer with Positional Decoupling for Pedestrian Crossing Intention Prediction [6.327758022051579]
GTransPDM was developed for pedestrian crossing intention prediction by leveraging multi-modal features.
It achieves 92% accuracy on the PIE dataset and 87% accuracy on the JAAD dataset, with a processing speed of 0.05ms.
arXiv Detail & Related papers (2024-09-30T12:02:17Z) - Event-Aided Time-to-Collision Estimation for Autonomous Driving [28.13397992839372]
We present a novel method that estimates the time to collision using a neuromorphic event-based camera.
The proposed algorithm consists of a two-step approach for efficient and accurate geometric model fitting on event data.
Experiments on both synthetic and real data demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-07-10T02:37:36Z) - DuEqNet: Dual-Equivariance Network in Outdoor 3D Object Detection for
Autonomous Driving [4.489333751818157]
We propose DuEqNet, which first introduces the concept of equivariance into 3D object detection network.
The dual-equivariant of our model can extract the equivariant features at both local and global levels.
Our model presents higher accuracy on orientation and better prediction efficiency.
arXiv Detail & Related papers (2023-02-27T08:30:02Z) - Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action
Recognition from Egocentric RGB Videos [50.74218823358754]
We develop a transformer-based framework to exploit temporal information for robust estimation.
We build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation.
Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O.
arXiv Detail & Related papers (2022-09-20T05:52:54Z) - Unsupervised Foggy Scene Understanding via Self Spatial-Temporal Label
Diffusion [51.11295961195151]
We exploit the characteristics of the foggy image sequence of driving scenes to densify the confident pseudo labels.
Based on the two discoveries of local spatial similarity and adjacent temporal correspondence of the sequential image data, we propose a novel Target-Domain driven pseudo label Diffusion scheme.
Our scheme helps the adaptive model achieve 51.92% and 53.84% mean intersection-over-union (mIoU) on two publicly available natural foggy datasets.
arXiv Detail & Related papers (2022-06-10T05:16:50Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence.
Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence.
We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z) - 2nd Place Solution for Waymo Open Dataset Challenge - Real-time 2D
Object Detection [26.086623067939605]
In this report, we introduce a real-time method to detect the 2D objects from images.
We leverage accelerationRT to optimize the inference time of our detection pipeline.
Our framework achieves the latency of 45.8ms/frame on an Nvidia Tesla V100 GPU.
arXiv Detail & Related papers (2021-06-16T11:32:03Z) - TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation [77.09542018140823]
We propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem.
TransCamP effectively fuses the image features, camera pose information and inter-frame relative camera motions into encoded graph attributes.
arXiv Detail & Related papers (2021-05-28T19:08:43Z) - A Driving Behavior Recognition Model with Bi-LSTM and Multi-Scale CNN [59.57221522897815]
We propose a neural network model based on trajectories information for driving behavior recognition.
We evaluate the proposed model on the public BLVD dataset, achieving a satisfying performance.
arXiv Detail & Related papers (2021-03-01T06:47:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.