Density-Guided Label Smoothing for Temporal Localization of Driving
Actions
- URL: http://arxiv.org/abs/2403.06616v1
- Date: Mon, 11 Mar 2024 11:06:41 GMT
- Title: Density-Guided Label Smoothing for Temporal Localization of Driving
Actions
- Authors: Tunc Alkanat, Erkut Akdag, Egor Bondarev, Peter H. N. De With
- Abstract summary: We focus on improving the overall performance by efficiently utilizing video action recognition networks.
We design a post-processing step to efficiently fuse information from video-segments and multiple camera views into scene-level predictions.
Our methodology yields a competitive performance on the A2 test set of the naturalistic driving action recognition track of the 2022 NVIDIA AI City Challenge with an F1 score of 0.271.
- Score: 8.841708075914353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal localization of driving actions plays a crucial role in advanced
driver-assistance systems and naturalistic driving studies. However, this is a
challenging task due to strict requirements for robustness, reliability and
accurate localization. In this work, we focus on improving the overall
performance by efficiently utilizing video action recognition networks and
adapting these to the problem of action localization. To this end, we first
develop a density-guided label smoothing technique based on label probability
distributions to facilitate better learning from boundary video-segments that
typically include multiple labels. Second, we design a post-processing step to
efficiently fuse information from video-segments and multiple camera views into
scene-level predictions, which facilitates elimination of false positives. Our
methodology yields a competitive performance on the A2 test set of the
naturalistic driving action recognition track of the 2022 NVIDIA AI City
Challenge with an F1 score of 0.271.
Related papers
- DSDFormer: An Innovative Transformer-Mamba Framework for Robust High-Precision Driver Distraction Identification [23.05821759499963]
Driver distraction remains a leading cause of traffic accidents, posing a critical threat to road safety globally.
We propose DSDFormer, a framework that integrates the strengths of Transformer and Mamba architectures.
We also introduce Temporal Reasoning Confident Learning (TRCL), an unsupervised approach that refines noisy labels by leveragingtemporal correlations in video.
arXiv Detail & Related papers (2024-09-09T13:16:15Z) - FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition [57.17966905865054]
Real-life applications of action recognition often require a fine-grained understanding of subtle movements.
Existing semi-supervised action recognition has mainly focused on coarse-grained action recognition.
We propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs.
arXiv Detail & Related papers (2024-09-02T20:08:06Z) - DeepLocalization: Using change point detection for Temporal Action Localization [2.4502578110136946]
We introduce DeepLocalization, an innovative framework devised for the real-time localization of actions tailored explicitly for monitoring driver behavior.
Our strategy employs a dual approach: leveraging Graph-Based Change-Point Detection for pinpointing actions in time alongside a Video Large Language Model (Video-LLM) for precisely categorizing activities.
arXiv Detail & Related papers (2024-04-18T15:25:59Z) - Transformer-based Fusion of 2D-pose and Spatio-temporal Embeddings for
Distracted Driver Action Recognition [8.841708075914353]
Temporal localization of driving actions over time is important for advanced driver-assistance systems and naturalistic driving studies.
We aim to improve the temporal localization and classification accuracy performance by adapting video action recognition and 2D human-based estimation networks to one model.
The model performs well on the A2 test set the 2023 NVIDIA AI City Challenge for naturalistic driving action recognition.
arXiv Detail & Related papers (2024-03-11T10:26:38Z) - Unsupervised Domain Adaptation for Self-Driving from Past Traversal
Features [69.47588461101925]
We propose a method to adapt 3D object detectors to new driving environments.
Our approach enhances LiDAR-based detection models using spatial quantized historical features.
Experiments on real-world datasets demonstrate significant improvements.
arXiv Detail & Related papers (2023-09-21T15:00:31Z) - M$^2$DAR: Multi-View Multi-Scale Driver Action Recognition with Vision
Transformer [5.082919518353888]
We present a multi-view, multi-scale framework for naturalistic driving action recognition and localization in untrimmed videos.
Our system features a weight-sharing, multi-scale Transformer-based action recognition network that learns robust hierarchical representations.
arXiv Detail & Related papers (2023-05-13T02:38:15Z) - Weakly-Supervised Temporal Action Localization by Inferring Salient
Snippet-Feature [26.7937345622207]
Weakly-supervised temporal action localization aims to locate action regions and identify action categories in unsupervised videos simultaneously.
Pseudo label generation is a promising strategy to solve the challenging problem, but the current methods ignore the natural temporal structure of the video.
We propose a novel weakly-supervised temporal action localization method by inferring salient snippet-feature.
arXiv Detail & Related papers (2023-03-22T06:08:34Z) - E^2TAD: An Energy-Efficient Tracking-based Action Detector [78.90585878925545]
This paper presents a tracking-based solution to accurately and efficiently localize predefined key actions.
It won first place in the UAV-Video Track of 2021 Low-Power Computer Vision Challenge (LPCVC)
arXiv Detail & Related papers (2022-04-09T07:52:11Z) - MIST: Multiple Instance Self-Training Framework for Video Anomaly
Detection [76.80153360498797]
We develop a multiple instance self-training framework (MIST) to efficiently refine task-specific discriminative representations.
MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder.
Our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.
arXiv Detail & Related papers (2021-04-04T15:47:14Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.