Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose
Estimation
- URL: http://arxiv.org/abs/2110.05092v1
- Date: Mon, 11 Oct 2021 08:57:43 GMT
- Title: Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose
Estimation
- Authors: Hui Shuai, Lele Wu, and Qingshan Liu
- Abstract summary: 3D Human Pose Estimation (HPE) is facing with several variable elements, involving the number of views, the length of the video sequence, and whether using camera calibration.
We propose a unified framework named Multi-view and Temporal Fusing Transformer (MTF-Transformer) to adaptively handle varying view numbers and video length without calibration.
- Score: 10.625664582408687
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In practical application, 3D Human Pose Estimation (HPE) is facing with
several variable elements, involving the number of views, the length of the
video sequence, and whether using camera calibration. To this end, we propose a
unified framework named Multi-view and Temporal Fusing Transformer
(MTF-Transformer) to adaptively handle varying view numbers and video length
without calibration. MTF-Transformer consists of Feature Extractor, Multi-view
Fusing Transformer (MFT), and Temporal Fusing Transformer (TFT). Feature
Extractor estimates the 2D pose from each image and encodes the predicted
coordinates and confidence into feature embedding for further 3D pose
inference. It discards the image features and focuses on lifting the 2D pose
into the 3D pose, making the subsequent modules computationally lightweight
enough to handle videos. MFT fuses the features of a varying number of views
with a relative-attention block. It adaptively measures the implicit
relationship between each pair of views and reconstructs the features. TFT
aggregates the features of the whole sequence and predicts 3D pose via a
transformer, which is adaptive to the length of the video and takes full
advantage of the temporal information. With these modules, MTF-Transformer
handles different application scenes, varying from a monocular-single-image to
multi-view-video, and the camera calibration is avoidable. We demonstrate
quantitative and qualitative results on the Human3.6M, TotalCapture, and KTH
Multiview Football II. Compared with state-of-the-art methods with camera
parameters, experiments show that MTF-Transformer not only obtains comparable
results but also generalizes well to dynamic capture with an arbitrary number
of unseen views. Code is available in
https://github.com/lelexx/MTF-Transformer.
Related papers
- Human Mesh Recovery from Arbitrary Multi-view Images [57.969696744428475]
We propose a divide and conquer framework for Unified Human Mesh Recovery (U-HMR) from arbitrary multi-view images.
In particular, U-HMR consists of a decoupled structure and two main components: camera and body decoupling (CBD), camera pose estimation (CPE) and arbitrary view fusion (AVF)
We conduct extensive experiments on three public datasets: Human3.6M, MPI-INF-3DHP, and TotalCapture.
arXiv Detail & Related papers (2024-03-19T04:47:56Z) - PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape
Prediction [77.89935657608926]
We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing a 3D object from a few unposed images.
PF-LRM simultaneously estimates the relative camera poses in 1.3 seconds on a single A100 GPU.
arXiv Detail & Related papers (2023-11-20T18:57:55Z) - MVTN: Learning Multi-View Transformations for 3D Understanding [60.15214023270087]
We introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition.
MVTN can be trained end-to-end with any multi-view network for 3D shape recognition.
Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks.
arXiv Detail & Related papers (2022-12-27T12:09:16Z) - Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting
Transformers [28.586258731448687]
We present a Transformer-based pose uplifting scheme that can operate on temporally sparse 2D pose sequences.
We show how masked token modeling can be utilized for temporal upsampling within Transformer blocks.
We evaluate our method on two popular benchmark datasets: Human3.6M and MPI-INF-3DHP.
arXiv Detail & Related papers (2022-10-12T12:00:56Z) - Pix4Point: Image Pretrained Standard Transformers for 3D Point Cloud
Understanding [62.502694656615496]
We present Progressive Point Patch Embedding and present a new point cloud Transformer model namely PViT.
PViT shares the same backbone as Transformer but is shown to be less hungry for data, enabling Transformer to achieve performance comparable to the state-of-the-art.
We formulate a simple yet effective pipeline dubbed "Pix4Point" that allows harnessing Transformers pretrained in the image domain to enhance downstream point cloud understanding.
arXiv Detail & Related papers (2022-08-25T17:59:29Z) - VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose
Estimation [4.603321798937854]
Volumetric Transformer Pose estimator (VTP) is the first 3D transformer framework for multi-view multi-person 3D human pose estimation.
VTP aggregates features from 2D keypoints in all camera views and learns the relationships in the 3D voxel space in an end-to-end fashion.
arXiv Detail & Related papers (2022-05-25T09:26:42Z) - Video Transformers: A Survey [42.314208650554264]
We study the contributions and trends for adapting Transformers to model video data.
Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones.
Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches.
arXiv Detail & Related papers (2022-01-16T07:31:55Z) - TransFusion: Cross-view Fusion with Transformer for 3D Human Pose
Estimation [21.37032015978738]
We introduce a transformer framework for multi-view 3D pose estimation.
Inspired by previous multi-modal transformers, we design a unified transformer architecture, named TransFusion.
We propose the concept of epipolar field to encode 3D positional information into the transformer model.
arXiv Detail & Related papers (2021-10-18T18:08:18Z) - FLEX: Parameter-free Multi-view 3D Human Motion Reconstruction [70.09086274139504]
Multi-view algorithms strongly depend on camera parameters, in particular, the relative positions among the cameras.
We introduce FLEX, an end-to-end parameter-free multi-view model.
We demonstrate results on the Human3.6M and KTH Multi-view Football II datasets.
arXiv Detail & Related papers (2021-05-05T09:08:12Z) - 3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos.
Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure.
We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.