VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose
Estimation
- URL: http://arxiv.org/abs/2205.12602v1
- Date: Wed, 25 May 2022 09:26:42 GMT
- Title: VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose
Estimation
- Authors: Yuxing Chen, Renshu Gu, Ouhan Huang and Gangyong Jia
- Abstract summary: Volumetric Transformer Pose estimator (VTP) is the first 3D transformer framework for multi-view multi-person 3D human pose estimation.
VTP aggregates features from 2D keypoints in all camera views and learns the relationships in the 3D voxel space in an end-to-end fashion.
- Score: 4.603321798937854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents Volumetric Transformer Pose estimator (VTP), the first 3D
volumetric transformer framework for multi-view multi-person 3D human pose
estimation. VTP aggregates features from 2D keypoints in all camera views and
directly learns the spatial relationships in the 3D voxel space in an
end-to-end fashion. The aggregated 3D features are passed through 3D
convolutions before being flattened into sequential embeddings and fed into a
transformer. A residual structure is designed to further improve the
performance. In addition, the sparse Sinkhorn attention is empowered to reduce
the memory cost, which is a major bottleneck for volumetric representations,
while also achieving excellent performance. The output of the transformer is
again concatenated with 3D convolutional features by a residual design. The
proposed VTP framework integrates the high performance of the transformer with
volumetric representations, which can be used as a good alternative to the
convolutional backbones. Experiments on the Shelf, Campus and CMU Panoptic
benchmarks show promising results in terms of both Mean Per Joint Position
Error (MPJPE) and Percentage of Correctly estimated Parts (PCP). Our code will
be available.
Related papers
- SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation [74.07836010698801]
We propose an SMPL-based Transformer framework (SMPLer) to address this issue.
SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation.
Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods.
arXiv Detail & Related papers (2024-04-23T17:59:59Z) - Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation [73.31524865643709]
We present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D pose estimation from videos.
Our HoDT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks.
Our method can achieve both high efficiency and estimation accuracy compared to the original VPT models.
arXiv Detail & Related papers (2023-11-20T18:59:51Z) - UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with
Progressive Video Transformers [71.72888202522644]
We propose a new end-to-end multi-person 3D and Shape estimation framework with progressive Video Transformer.
In PSVT, a-temporal encoder (PGA) captures the global feature dependencies among spatial objects.
To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used.
arXiv Detail & Related papers (2023-03-16T09:55:43Z) - DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets [95.84755169585492]
We present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception.
Our model achieves state-of-the-art performance with a broad range of 3D perception tasks.
arXiv Detail & Related papers (2023-01-15T09:31:58Z) - 3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field.
We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks.
We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z) - AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation [19.53151547706724]
transformer-based models have drawn attention to exploring these techniques in medical image segmentation.
We propose Axial Fusion Transformer UNet (AFTer-UNet), which takes both advantages of convolutional layers' capability of extracting detailed features and transformers' strength on long sequence modeling.
It has fewer parameters and takes less GPU memory to train than the previous transformer-based models.
arXiv Detail & Related papers (2021-10-20T06:47:28Z) - TransFusion: Cross-view Fusion with Transformer for 3D Human Pose
Estimation [21.37032015978738]
We introduce a transformer framework for multi-view 3D pose estimation.
Inspired by previous multi-modal transformers, we design a unified transformer architecture, named TransFusion.
We propose the concept of epipolar field to encode 3D positional information into the transformer model.
arXiv Detail & Related papers (2021-10-18T18:08:18Z) - Lifting Transformer for 3D Human Pose Estimation in Video [27.005291611674377]
We propose a novel Transformer-based architecture, called Lifting Transformer, for 3D human pose estimation.
A vanilla Transformer encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences.
A modified VTE is termed as strided Transformer encoder (STE) and it is built upon the outputs of VTE.
arXiv Detail & Related papers (2021-03-26T07:35:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.