A Spatio-Temporal Multilayer Perceptron for Gesture Recognition
- URL: http://arxiv.org/abs/2204.11511v1
- Date: Mon, 25 Apr 2022 08:42:47 GMT
- Title: A Spatio-Temporal Multilayer Perceptron for Gesture Recognition
- Authors: Adrian Holzbock, Alexander Tsaregorodtsev, Youssef Dawoud, Klaus
Dietmayer, Vasileios Belagiannis
- Abstract summary: We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
- Score: 70.34489104710366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gesture recognition is essential for the interaction of autonomous vehicles
with humans. While the current approaches focus on combining several modalities
like image features, keypoints and bone vectors, we present neural network
architecture that delivers state-of-the-art results only with body skeleton
input data. We propose the spatio-temporal multilayer perceptron for gesture
recognition in the context of autonomous vehicles. Given 3D body poses over
time, we define temporal and spatial mixing operations to extract features in
both domains. Additionally, the importance of each time step is re-weighted
with Squeeze-and-Excitation layers. An extensive evaluation of the TCG and
Drive&Act datasets is provided to showcase the promising performance of our
approach. Furthermore, we deploy our model to our autonomous vehicle to show
its real-time capability and stable execution.
Related papers
- Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation
Learning of Vision-based Autonomous Driving [73.3702076688159]
We propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence.
We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks.
arXiv Detail & Related papers (2024-02-23T19:43:01Z) - Trajeglish: Traffic Modeling as Next-Token Prediction [67.28197954427638]
A longstanding challenge for self-driving development is simulating dynamic driving scenarios seeded from recorded driving logs.
We apply tools from discrete sequence modeling to model how vehicles, pedestrians and cyclists interact in driving scenarios.
Our model tops the Sim Agents Benchmark, surpassing prior work along the realism meta metric by 3.3% and along the interaction metric by 9.9%.
arXiv Detail & Related papers (2023-12-07T18:53:27Z) - Gesture Recognition with Keypoint and Radar Stream Fusion for Automated
Vehicles [13.652770928249447]
We present a joint camera and radar approach to enable autonomous vehicles to understand and react to human gestures in everyday traffic.
We propose a fusion neural network for both modalities, including an auxiliary loss for each modality.
Motivated by adverse weather conditions, we also demonstrate promising performance when one of the sensors lacks functionality.
arXiv Detail & Related papers (2023-02-20T14:18:11Z) - ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal
Feature Learning [132.20119288212376]
We propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously.
To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system.
arXiv Detail & Related papers (2022-07-15T16:57:43Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - OpenPifPaf: Composite Fields for Semantic Keypoint Detection and
Spatio-Temporal Association [90.39247595214998]
Image-based perception tasks can be formulated as detecting, associating and semantic keypoints, e.g. human body pose estimation and tracking.
We present a general framework that jointly detects semantic andtemporal keypoint associations in a single stage.
We also show that our method generalizes to any class of keypoints such as car and animal parts to provide a holistic perception framework.
arXiv Detail & Related papers (2021-03-03T14:44:14Z) - Attention-Driven Body Pose Encoding for Human Activity Recognition [0.0]
This article proposes a novel attention-based body pose encoding for human activity recognition.
The enriched data complements the 3D body joint position data and improves model performance.
arXiv Detail & Related papers (2020-09-29T22:17:17Z) - Gesture Recognition from Skeleton Data for Intuitive Human-Machine
Interaction [0.6875312133832077]
We propose an approach for segmentation and classification of dynamic gestures based on a set of handcrafted features.
The method for gesture recognition applies a sliding window, which extracts information from both the spatial and temporal dimensions.
At the end, the recognized gestures are used to interact with a collaborative robot.
arXiv Detail & Related papers (2020-08-26T11:28:50Z) - A Graph Attention Spatio-temporal Convolutional Network for 3D Human
Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms.
Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.