Related papers: A Spatio-Temporal Multilayer Perceptron for Gesture Recognition

A Spatio-Temporal Multilayer Perceptron for Gesture Recognition

URL: http://arxiv.org/abs/2204.11511v1
Date: Mon, 25 Apr 2022 08:42:47 GMT
Title: A Spatio-Temporal Multilayer Perceptron for Gesture Recognition
Authors: Adrian Holzbock, Alexander Tsaregorodtsev, Youssef Dawoud, Klaus Dietmayer, Vasileios Belagiannis
Abstract summary: We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles. An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach. We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
Score: 70.34489104710366
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gesture recognition is essential for the interaction of autonomous vehicles with humans. While the current approaches focus on combining several modalities like image features, keypoints and bone vectors, we present neural network architecture that delivers state-of-the-art results only with body skeleton input data. We propose the spatio-temporal multilayer perceptron for gesture recognition in the context of autonomous vehicles. Given 3D body poses over time, we define temporal and spatial mixing operations to extract features in both domains. Additionally, the importance of each time step is re-weighted with Squeeze-and-Excitation layers. An extensive evaluation of the TCG and Drive&Act datasets is provided to showcase the promising performance of our approach. Furthermore, we deploy our model to our autonomous vehicle to show its real-time capability and stable execution.

Related papers

Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-based Autonomous Driving [73.3702076688159]
We propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence. We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks.
arXiv Detail & Related papers (2024-02-23T19:43:01Z)
Trajeglish: Traffic Modeling as Next-Token Prediction [67.28197954427638]
A longstanding challenge for self-driving development is simulating dynamic driving scenarios seeded from recorded driving logs. We apply tools from discrete sequence modeling to model how vehicles, pedestrians and cyclists interact in driving scenarios. Our model tops the Sim Agents Benchmark, surpassing prior work along the realism meta metric by 3.3% and along the interaction metric by 9.9%.
arXiv Detail & Related papers (2023-12-07T18:53:27Z)
Gesture Recognition with Keypoint and Radar Stream Fusion for Automated Vehicles [13.652770928249447]
We present a joint camera and radar approach to enable autonomous vehicles to understand and react to human gestures in everyday traffic. We propose a fusion neural network for both modalities, including an auxiliary loss for each modality. Motivated by adverse weather conditions, we also demonstrate promising performance when one of the sensors lacks functionality.
arXiv Detail & Related papers (2023-02-20T14:18:11Z)
ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning [132.20119288212376]
We propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously. To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system.
arXiv Detail & Related papers (2022-07-15T16:57:43Z)
Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision. We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z)
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks. To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame. Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)
OpenPifPaf: Composite Fields for Semantic Keypoint Detection and Spatio-Temporal Association [90.39247595214998]
Image-based perception tasks can be formulated as detecting, associating and semantic keypoints, e.g. human body pose estimation and tracking. We present a general framework that jointly detects semantic andtemporal keypoint associations in a single stage. We also show that our method generalizes to any class of keypoints such as car and animal parts to provide a holistic perception framework.
arXiv Detail & Related papers (2021-03-03T14:44:14Z)
Attention-Driven Body Pose Encoding for Human Activity Recognition [0.0]
This article proposes a novel attention-based body pose encoding for human activity recognition. The enriched data complements the 3D body joint position data and improves model performance.
arXiv Detail & Related papers (2020-09-29T22:17:17Z)
Gesture Recognition from Skeleton Data for Intuitive Human-Machine Interaction [0.6875312133832077]
We propose an approach for segmentation and classification of dynamic gestures based on a set of handcrafted features. The method for gesture recognition applies a sliding window, which extracts information from both the spatial and temporal dimensions. At the end, the recognized gestures are used to interact with a collaborative robot.
arXiv Detail & Related papers (2020-08-26T11:28:50Z)
A Graph Attention Spatio-temporal Convolutional Network for 3D Human Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms. Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.