Lifting Transformer for 3D Human Pose Estimation in Video
- URL: http://arxiv.org/abs/2103.14304v1
- Date: Fri, 26 Mar 2021 07:35:08 GMT
- Title: Lifting Transformer for 3D Human Pose Estimation in Video
- Authors: Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang
- Abstract summary: We propose a novel Transformer-based architecture, called Lifting Transformer, for 3D human pose estimation.
A vanilla Transformer encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences.
A modified VTE is termed as strided Transformer encoder (STE) and it is built upon the outputs of VTE.
- Score: 27.005291611674377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite great progress in video-based 3D human pose estimation, it is still
challenging to learn a discriminative single-pose representation from redundant
sequences. To this end, we propose a novel Transformer-based architecture,
called Lifting Transformer, for 3D human pose estimation to lift a sequence of
2D joint locations to a 3D pose. Specifically, a vanilla Transformer encoder
(VTE) is adopted to model long-range dependencies of 2D pose sequences. To
reduce redundancy of the sequence and aggregate information from local context,
fully-connected layers in the feed-forward network of VTE are replaced with
strided convolutions to progressively reduce the sequence length. The modified
VTE is termed as strided Transformer encoder (STE) and it is built upon the
outputs of VTE. STE not only significantly reduces the computation cost but
also effectively aggregates information to a single-vector representation in a
global and local fashion. Moreover, a full-to-single supervision scheme is
employed at both the full sequence scale and single target frame scale,
applying to the outputs of VTE and STE, respectively. This scheme imposes extra
temporal smoothness constraints in conjunction with the single target frame
supervision. The proposed architecture is evaluated on two challenging
benchmark datasets, namely, Human3.6M and HumanEva-I, and achieves
state-of-the-art results with much fewer parameters.
Related papers
- SGIFormer: Semantic-guided and Geometric-enhanced Interleaving Transformer for 3D Instance Segmentation [14.214197948110115]
This paper introduces a novel method, named SGIFormer, for 3D instance segmentation.
It is composed of the Semantic-guided Mix Query (SMQ) and the Geometric-enhanced Interleaving Transformer (GIT) decoder.
It attains state-of-the-art performance on ScanNet V2, ScanNet200, and the challenging high-fidelity ScanNet++ benchmark.
arXiv Detail & Related papers (2024-07-16T10:17:28Z) - Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture.
Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model.
Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D
Reconstruction with Transformers [37.14235383028582]
We introduce a novel approach for single-view reconstruction that efficiently generates a 3D model from a single image via feed-forward inference.
Our method utilizes two transformer-based networks, namely a point decoder and a triplane decoder, to reconstruct 3D objects using a hybrid Triplane-Gaussian intermediate representation.
arXiv Detail & Related papers (2023-12-14T17:18:34Z) - Global-to-Local Modeling for Video-based 3D Human Pose and Shape
Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness.
We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT)
Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z) - PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with
Progressive Video Transformers [71.72888202522644]
We propose a new end-to-end multi-person 3D and Shape estimation framework with progressive Video Transformer.
In PSVT, a-temporal encoder (PGA) captures the global feature dependencies among spatial objects.
To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used.
arXiv Detail & Related papers (2023-03-16T09:55:43Z) - IVT: An End-to-End Instance-guided Video Transformer for 3D Pose
Estimation [6.270047084514142]
Video 3D human pose estimation aims to localize the 3D coordinates of human joints from videos.
IVT enables learningtemporal contextual depth information from visual features and 3D poses directly from video frames.
Experiments on three widely-used 3D pose estimation benchmarks show that the proposed IVT achieves state-of-the-art performances.
arXiv Detail & Related papers (2022-08-06T02:36:33Z) - VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose
Estimation [4.603321798937854]
Volumetric Transformer Pose estimator (VTP) is the first 3D transformer framework for multi-view multi-person 3D human pose estimation.
VTP aggregates features from 2D keypoints in all camera views and learns the relationships in the 3D voxel space in an end-to-end fashion.
arXiv Detail & Related papers (2022-05-25T09:26:42Z) - Multi-initialization Optimization Network for Accurate 3D Human Pose and
Shape Estimation [75.44912541912252]
We propose a three-stage framework named Multi-Initialization Optimization Network (MION)
In the first stage, we strategically select different coarse 3D reconstruction candidates which are compatible with the 2D keypoints of input sample.
In the second stage, we design a mesh refinement transformer (MRT) to respectively refine each coarse reconstruction result via a self-attention mechanism.
Finally, a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction.
arXiv Detail & Related papers (2021-12-24T02:43:58Z) - Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image.
Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space.
We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step.
This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.