IVT: An End-to-End Instance-guided Video Transformer for 3D Pose
Estimation
- URL: http://arxiv.org/abs/2208.03431v1
- Date: Sat, 6 Aug 2022 02:36:33 GMT
- Title: IVT: An End-to-End Instance-guided Video Transformer for 3D Pose
Estimation
- Authors: Zhongwei Qiu, Qiansheng Yang, Jian Wang, Dongmei Fu
- Abstract summary: Video 3D human pose estimation aims to localize the 3D coordinates of human joints from videos.
IVT enables learningtemporal contextual depth information from visual features and 3D poses directly from video frames.
Experiments on three widely-used 3D pose estimation benchmarks show that the proposed IVT achieves state-of-the-art performances.
- Score: 6.270047084514142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video 3D human pose estimation aims to localize the 3D coordinates of human
joints from videos. Recent transformer-based approaches focus on capturing the
spatiotemporal information from sequential 2D poses, which cannot model the
contextual depth feature effectively since the visual depth features are lost
in the step of 2D pose estimation. In this paper, we simplify the paradigm into
an end-to-end framework, Instance-guided Video Transformer (IVT), which enables
learning spatiotemporal contextual depth information from visual features
effectively and predicts 3D poses directly from video frames. In particular, we
firstly formulate video frames as a series of instance-guided tokens and each
token is in charge of predicting the 3D pose of a human instance. These tokens
contain body structure information since they are extracted by the guidance of
joint offsets from the human center to the corresponding body joints. Then,
these tokens are sent into IVT for learning spatiotemporal contextual depth. In
addition, we propose a cross-scale instance-guided attention mechanism to
handle the variational scales among multiple persons. Finally, the 3D poses of
each person are decoded from instance-guided tokens by coordinate regression.
Experiments on three widely-used 3D pose estimation benchmarks show that the
proposed IVT achieves state-of-the-art performances.
Related papers
- 2D or not 2D: How Does the Dimensionality of Gesture Representation Affect 3D Co-Speech Gesture Generation? [5.408549711581793]
We study the effect of using either 2D or 3D joint coordinates as training data on the performance of speech-to-gesture deep generative models.
We employ a lifting model for converting generated 2D pose sequences into 3D and assess how gestures created directly in 3D stack up against those initially generated in 2D and then converted to 3D.
arXiv Detail & Related papers (2024-09-16T15:06:12Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation
Learning of Vision-based Autonomous Driving [73.3702076688159]
We propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence.
We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks.
arXiv Detail & Related papers (2024-02-23T19:43:01Z) - Geometry-Biased Transformer for Robust Multi-View 3D Human Pose
Reconstruction [3.069335774032178]
We propose a novel encoder-decoder Transformer architecture to estimate 3D poses from multi-view 2D pose sequences.
We conduct experiments on three benchmark public datasets, Human3.6M, CMU Panoptic and Occlusion-Persons.
arXiv Detail & Related papers (2023-12-28T16:30:05Z) - 3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images.
We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image.
We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z) - Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video [23.93644678238666]
We propose a Pose and Mesh Co-Evolution network (PMCE) to recover 3D human motion from a video.
The proposed PMCE outperforms previous state-of-the-art methods in terms of both per-frame accuracy and temporal consistency.
arXiv Detail & Related papers (2023-08-20T16:03:21Z) - LPFormer: LiDAR Pose Estimation Transformer with Multi-Task Network [12.968650885640127]
Previous methods for 3D human pose estimation have often relied on 2D image features and sequential 2D annotations.
We present the 1st framework for end-to-end 3D human pose estimation, named LPFormer, which uses only LiDAR as its input.
arXiv Detail & Related papers (2023-06-21T19:20:15Z) - PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with
Progressive Video Transformers [71.72888202522644]
We propose a new end-to-end multi-person 3D and Shape estimation framework with progressive Video Transformer.
In PSVT, a-temporal encoder (PGA) captures the global feature dependencies among spatial objects.
To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used.
arXiv Detail & Related papers (2023-03-16T09:55:43Z) - Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo [71.59494156155309]
Existing approaches for multi-view 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views.
We present our multi-view 3D pose estimation approach based on plane sweep stereo to jointly address the cross-view fusion and 3D pose reconstruction in a single shot.
arXiv Detail & Related papers (2021-04-06T03:49:35Z) - 3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos.
Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure.
We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.