Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation
- URL: http://arxiv.org/abs/2110.11680v1
- Date: Fri, 22 Oct 2021 10:01:13 GMT
- Title: Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation
- Authors: Ziwen Li, Bo Xu, Han Huang, Cheng Lu and Yandong Guo
- Abstract summary: We propose a new framework to generate 3D human pose and mesh from RGB videos.
We train a two-stream temporal network based on transformer to predict SMPL parameters.
The proposed algorithm is extensively evaluated on the Human3.6 and 3DPW datasets.
- Score: 18.14237514372724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several video-based 3D pose and shape estimation algorithms have been
proposed to resolve the temporal inconsistency of single-image-based methods.
However it still remains challenging to have stable and accurate
reconstruction. In this paper, we propose a new framework Deep Two-Stream Video
Inference for Human Body Pose and Shape Estimation (DTS-VIBE), to generate 3D
human pose and mesh from RGB videos. We reformulate the task as a
multi-modality problem that fuses RGB and optical flow for more reliable
estimation. In order to fully utilize both sensory modalities (RGB or optical
flow), we train a two-stream temporal network based on transformer to predict
SMPL parameters. The supplementary modality, optical flow, helps to maintain
temporal consistency by leveraging motion knowledge between two consecutive
frames. The proposed algorithm is extensively evaluated on the Human3.6 and
3DPW datasets. The experimental results show that it outperforms other
state-of-the-art methods by a significant margin.
Related papers
- MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation [54.27399121779011]
We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images.
We show that our approach can yield more accurate synthesis compared to recent state-of-the-art, including distillation-based 3D inference and prior multi-view generation methods.
arXiv Detail & Related papers (2024-04-04T17:59:57Z) - Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video [23.93644678238666]
We propose a Pose and Mesh Co-Evolution network (PMCE) to recover 3D human motion from a video.
The proposed PMCE outperforms previous state-of-the-art methods in terms of both per-frame accuracy and temporal consistency.
arXiv Detail & Related papers (2023-08-20T16:03:21Z) - Unfolding Framework with Prior of Convolution-Transformer Mixture and
Uncertainty Estimation for Video Snapshot Compressive Imaging [7.601695814245209]
We consider the problem of video snapshot compressive imaging (SCI), where sequential high-speed frames are modulated by different masks and captured by a single measurement.
By combining optimization algorithms and neural networks, deep unfolding networks (DUNs) score tremendous achievements in solving inverse problems.
arXiv Detail & Related papers (2023-06-20T06:25:48Z) - TAPE: Temporal Attention-based Probabilistic human pose and shape
Estimation [7.22614468437919]
Existing methods ignore the ambiguities of the reconstruction and provide a single deterministic estimate for the 3D pose.
We present a Temporal Attention based Probabilistic human pose and shape Estimation method (TAPE) that operates on an RGB video.
We show that TAPE outperforms state-of-the-art methods in standard benchmarks.
arXiv Detail & Related papers (2023-04-29T06:08:43Z) - Shakes on a Plane: Unsupervised Depth Estimation from Unstabilized
Photography [54.36608424943729]
We show that in a ''long-burst'', forty-two 12-megapixel RAW frames captured in a two-second sequence, there is enough parallax information from natural hand tremor alone to recover high-quality scene depth.
We devise a test-time optimization approach that fits a neural RGB-D representation to long-burst data and simultaneously estimates scene depth and camera motion.
arXiv Detail & Related papers (2022-12-22T18:54:34Z) - Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation.
We exploit temporal information in videos and propose a self-attention module.
We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z) - A Single Stream Network for Robust and Real-time RGB-D Salient Object
Detection [89.88222217065858]
We design a single stream network to use the depth map to guide early fusion and middle fusion between RGB and depth.
This model is 55.5% lighter than the current lightest model and runs at a real-time speed of 32 FPS when processing a $384 times 384$ image.
arXiv Detail & Related papers (2020-07-14T04:40:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.