Transformer-based Fusion of 2D-pose and Spatio-temporal Embeddings for
Distracted Driver Action Recognition
- URL: http://arxiv.org/abs/2403.06577v1
- Date: Mon, 11 Mar 2024 10:26:38 GMT
- Title: Transformer-based Fusion of 2D-pose and Spatio-temporal Embeddings for
Distracted Driver Action Recognition
- Authors: Erkut Akdag, Zeqi Zhu, Egor Bondarev, Peter H. N. De With
- Abstract summary: Temporal localization of driving actions over time is important for advanced driver-assistance systems and naturalistic driving studies.
We aim to improve the temporal localization and classification accuracy performance by adapting video action recognition and 2D human-based estimation networks to one model.
The model performs well on the A2 test set the 2023 NVIDIA AI City Challenge for naturalistic driving action recognition.
- Score: 8.841708075914353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Classification and localization of driving actions over time is important for
advanced driver-assistance systems and naturalistic driving studies. Temporal
localization is challenging because it requires robustness, reliability, and
accuracy. In this study, we aim to improve the temporal localization and
classification accuracy performance by adapting video action recognition and 2D
human-pose estimation networks to one model. Therefore, we design a
transformer-based fusion architecture to effectively combine 2D-pose features
and spatio-temporal features. The model uses 2D-pose features as the positional
embedding of the transformer architecture and spatio-temporal features as the
main input to the encoder of the transformer. The proposed solution is generic
and independent of the camera numbers and positions, giving frame-based class
probabilities as output. Finally, the post-processing step combines information
from different camera views to obtain final predictions and eliminate false
positives. The model performs well on the A2 test set of the 2023 NVIDIA AI
City Challenge for naturalistic driving action recognition, achieving the
overlap score of the organizer-defined distracted driver behaviour metric of
0.5079.
Related papers
- CorrDiff: Adaptive Delay-aware Detector with Temporal Cue Inputs for Real-time Object Detection [11.714072240331518]
CorrDiff is designed to tackle the challenge of delays in real-time detection systems.
It is able to utilize runtime-estimated temporal cues to predict objects' locations for multiple future frames.
It meets the stringent real-time processing requirements on all kinds of devices.
arXiv Detail & Related papers (2025-01-09T10:34:25Z) - Bench2Drive-R: Turning Real World Data into Reactive Closed-Loop Autonomous Driving Benchmark by Generative Model [63.336123527432136]
We introduce Bench2Drive-R, a generative framework that enables reactive closed-loop evaluation.
Unlike existing video generative models for autonomous driving, the proposed designs are tailored for interactive simulation.
We compare the generation quality of Bench2Drive-R with existing generative models and achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-12-11T06:35:18Z) - Event-Based Tracking Any Point with Motion-Augmented Temporal Consistency [58.719310295870024]
This paper presents an event-based framework for tracking any point.
It tackles the challenges posed by spatial sparsity and motion sensitivity in events.
It achieves 150% faster processing with competitive model parameters.
arXiv Detail & Related papers (2024-12-02T09:13:29Z) - GTransPDM: A Graph-embedded Transformer with Positional Decoupling for Pedestrian Crossing Intention Prediction [6.327758022051579]
GTransPDM was developed for pedestrian crossing intention prediction by leveraging multi-modal features.
It achieves 92% accuracy on the PIE dataset and 87% accuracy on the JAAD dataset, with a processing speed of 0.05ms.
arXiv Detail & Related papers (2024-09-30T12:02:17Z) - DuEqNet: Dual-Equivariance Network in Outdoor 3D Object Detection for
Autonomous Driving [4.489333751818157]
We propose DuEqNet, which first introduces the concept of equivariance into 3D object detection network.
The dual-equivariant of our model can extract the equivariant features at both local and global levels.
Our model presents higher accuracy on orientation and better prediction efficiency.
arXiv Detail & Related papers (2023-02-27T08:30:02Z) - Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action
Recognition from Egocentric RGB Videos [50.74218823358754]
We develop a transformer-based framework to exploit temporal information for robust estimation.
We build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation.
Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O.
arXiv Detail & Related papers (2022-09-20T05:52:54Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence.
Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence.
We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z) - 2nd Place Solution for Waymo Open Dataset Challenge - Real-time 2D
Object Detection [26.086623067939605]
In this report, we introduce a real-time method to detect the 2D objects from images.
We leverage accelerationRT to optimize the inference time of our detection pipeline.
Our framework achieves the latency of 45.8ms/frame on an Nvidia Tesla V100 GPU.
arXiv Detail & Related papers (2021-06-16T11:32:03Z) - TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation [77.09542018140823]
We propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem.
TransCamP effectively fuses the image features, camera pose information and inter-frame relative camera motions into encoded graph attributes.
arXiv Detail & Related papers (2021-05-28T19:08:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.