Event-based Human Pose Tracking by Spiking Spatiotemporal Transformer
- URL: http://arxiv.org/abs/2303.09681v4
- Date: Wed, 6 Sep 2023 21:34:59 GMT
- Title: Event-based Human Pose Tracking by Spiking Spatiotemporal Transformer
- Authors: Shihao Zou, Yuxuan Mu, Xinxin Zuo, Sen Wang, Li Cheng
- Abstract summary: We present a dedicated end-to-end sparse deep approach for event-based pose tracking.
This is the first time that 3D human pose tracking is obtained from events only.
Our approach also achieves a significant reduction of 80% in FLOPS.
- Score: 20.188995900488717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Event camera, as an emerging biologically-inspired vision sensor for
capturing motion dynamics, presents new potential for 3D human pose tracking,
or video-based 3D human pose estimation. However, existing works in pose
tracking either require the presence of additional gray-scale images to
establish a solid starting pose, or ignore the temporal dependencies all
together by collapsing segments of event streams to form static event frames.
Meanwhile, although the effectiveness of Artificial Neural Networks (ANNs,
a.k.a. dense deep learning) has been showcased in many event-based tasks, the
use of ANNs tends to neglect the fact that compared to the dense frame-based
image sequences, the occurrence of events from an event camera is
spatiotemporally much sparser. Motivated by the above mentioned issues, we
present in this paper a dedicated end-to-end sparse deep learning approach for
event-based pose tracking: 1) to our knowledge this is the first time that 3D
human pose tracking is obtained from events only, thus eliminating the need of
accessing to any frame-based images as part of input; 2) our approach is based
entirely upon the framework of Spiking Neural Networks (SNNs), which consists
of Spike-Element-Wise (SEW) ResNet and a novel Spiking Spatiotemporal
Transformer; 3) a large-scale synthetic dataset is constructed that features a
broad and diverse set of annotated 3D human motions, as well as longer hours of
event stream data, named SynEventHPD. Empirical experiments demonstrate that,
with superior performance over the state-of-the-art (SOTA) ANNs counterparts,
our approach also achieves a significant computation reduction of 80% in FLOPS.
Furthermore, our proposed method also outperforms SOTA SNNs in the regression
task of human pose tracking. Our implementation is available at
https://github.com/JimmyZou/HumanPoseTracking_SNN and dataset will be released
upon paper acceptance.
Related papers
- MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.
By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.
We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - 3D Human Scan With A Moving Event Camera [7.734104968315144]
Event cameras have the advantages of high temporal resolution and high dynamic range.
This paper proposes a novel event-based method for 3D pose estimation and human mesh recovery.
arXiv Detail & Related papers (2024-04-12T14:34:24Z) - Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers [28.38686299271394]
We propose a framework for 3D sequence-to-sequence (seq2seq) human pose detection.
Firstly, the spatial module represents the human pose feature by intra-image content, while the frame-image relation module extracts temporal relationships.
Our method is evaluated on Human3.6M, a popular 3D human pose detection dataset.
arXiv Detail & Related papers (2024-01-30T03:00:25Z) - Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior.
Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z) - EventTransAct: A video transformer-based framework for Event-camera
based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos.
In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame.
In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z) - A Temporal Densely Connected Recurrent Network for Event-based Human
Pose Estimation [24.367222637492787]
Event camera is an emerging bio-inspired vision sensors that report per-pixel brightness changes asynchronously.
This paper proposes a novel densely connected recurrent architecture to address the problem of incomplete information.
By this recurrent architecture, we can explicitly model not only the sequential but also non-sequential geometric consistency across time steps.
arXiv Detail & Related papers (2022-09-15T04:08:18Z) - EventHPE: Event-based 3D Human Pose and Shape Estimation [33.197194879047956]
Event camera is an emerging imaging sensor for capturing dynamics of moving objects as events.
We propose a two-stage deep learning approach, called EventHPE.
The first-stage, FlowNet, is trained by unsupervised learning to infer optical flow from events.
The second-stage, ShapeNet, is fed as input to the ShapeNet in the second stage to estimate 3D human shapes.
arXiv Detail & Related papers (2021-08-15T21:40:19Z) - Multi-level Motion Attention for Human Motion Prediction [132.29963836262394]
We study the use of different types of attention, computed at joint, body part, and full pose levels.
Our experiments on Human3.6M, AMASS and 3DPW validate the benefits of our approach for both periodical and non-periodical actions.
arXiv Detail & Related papers (2021-06-17T08:08:11Z) - Learning Dynamics via Graph Neural Networks for Human Pose Estimation
and Tracking [98.91894395941766]
We propose a novel online approach to learning the pose dynamics, which are independent of pose detections in current fame.
Specifically, we derive this prediction of dynamics through a graph neural network(GNN) that explicitly accounts for both spatial-temporal and visual information.
Experiments on PoseTrack 2017 and PoseTrack 2018 datasets demonstrate that the proposed method achieves results superior to the state of the art on both human pose estimation and tracking tasks.
arXiv Detail & Related papers (2021-06-07T16:36:50Z) - Multi-Scale Networks for 3D Human Pose Estimation with Inference Stage
Optimization [33.02708860641971]
Estimating 3D human poses from a monocular video is still a challenging task.
Many existing methods drop when the target person is cluded by other objects, or the motion is too fast/slow relative to the scale and speed of the training data.
We introduce atemporal-temporal network for robust 3D human pose estimation.
arXiv Detail & Related papers (2020-10-13T15:24:28Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.