Pose-guided multi-task video transformer for driver action recognition
- URL: http://arxiv.org/abs/2407.13750v1
- Date: Thu, 18 Jul 2024 17:53:51 GMT
- Title: Pose-guided multi-task video transformer for driver action recognition
- Authors: Ricardo Pizarro, Roberto Valle, Luis Miguel Bergasa, José M. Buenaposada, Luis Baumela,
- Abstract summary: We introduce a multi-task video transformer that predicts both distracted actions and driver pose.
Our model surpasses existing state-of-the art results in driver action recognition while exhibiting superior efficiency compared to current video transformer-based approaches.
- Score: 5.392333248149453
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We investigate the task of identifying situations of distracted driving through analysis of in-car videos. To tackle this challenge we introduce a multi-task video transformer that predicts both distracted actions and driver pose. Leveraging VideoMAEv2, a large pre-trained architecture, our approach incorporates semantic information from human keypoint locations to enhance action recognition and decrease computational overhead by minimizing the number of spatio-temporal tokens. By guiding token selection with pose and class information, we notably reduce the model's computational requirements while preserving the baseline accuracy. Our model surpasses existing state-of-the art results in driver action recognition while exhibiting superior efficiency compared to current video transformer-based approaches.
Related papers
- OneTrack-M: A multitask approach to transformer-based MOT models [0.0]
Multi-Object Tracking (MOT) is a critical problem in computer vision.
OneTrack-M is a transformer-based MOT model designed to enhance tracking computational efficiency and accuracy.
arXiv Detail & Related papers (2025-02-06T20:02:06Z) - An object detection approach for lane change and overtake detection from motion profiles [3.545178658731506]
In this paper, we address the identification of overtake and lane change maneuvers with a novel object detection approach applied to motion profiles.
To train and test our model we created an internal dataset of motion profile images obtained from a heterogeneous set of dashcam videos.
In addition to a standard object-detection approach, we show how the inclusion of CoordConvolution layers further improves the model performance.
arXiv Detail & Related papers (2025-02-06T17:36:35Z) - Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision.
We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos.
Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - M$^2$DAR: Multi-View Multi-Scale Driver Action Recognition with Vision
Transformer [5.082919518353888]
We present a multi-view, multi-scale framework for naturalistic driving action recognition and localization in untrimmed videos.
Our system features a weight-sharing, multi-scale Transformer-based action recognition network that learns robust hierarchical representations.
arXiv Detail & Related papers (2023-05-13T02:38:15Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Visual Exemplar Driven Task-Prompting for Unified Perception in
Autonomous Driving [100.3848723827869]
We present an effective multi-task framework, VE-Prompt, which introduces visual exemplars via task-specific prompting.
Specifically, we generate visual exemplars based on bounding boxes and color-based markers, which provide accurate visual appearances of target categories.
We bridge transformer-based encoders and convolutional layers for efficient and accurate unified perception in autonomous driving.
arXiv Detail & Related papers (2023-03-03T08:54:06Z) - Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition.
Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss.
We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z) - Vehicle-Human Interactive Behaviors in Emergency: Data Extraction from
Traffic Accident Videos [0.0]
Currently, studying the vehicle-human interactive behavior in the emergency needs a large amount of datasets in the actual emergent situations that are almost unavailable.
This paper provides a new yet convenient way to extract the interactive behavior data (i.e., the trajectories of vehicles and humans) from actual accident videos.
The main challenge for data extraction from real-time accident video lies in the fact that the recording cameras are un-calibrated and the angles of surveillance are unknown.
arXiv Detail & Related papers (2020-03-02T22:17:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.