PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling
- URL: http://arxiv.org/abs/2208.10211v1
- Date: Mon, 22 Aug 2022 11:30:14 GMT
- Title: PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling
- Authors: Fabien Baradel, Romain Br\'egier, Thibault Groueix, Philippe
Weinzaepfel, Yannis Kalantidis, Gr\'egory Rogez
- Abstract summary: PoseBERT is a transformer module that is fully trained on 3D Motion Capture data via masked modeling.
It is simple, generic and versatile, as it can be plugged on top of any image-based model to transform it in a video-based model.
Our experimental results validate that adding PoseBERT on top of various state-of-the-art pose estimation methods consistently improves their performances.
- Score: 23.420076136028687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training state-of-the-art models for human pose estimation in videos requires
datasets with annotations that are really hard and expensive to obtain.
Although transformers have been recently utilized for body pose sequence
modeling, related methods rely on pseudo-ground truth to augment the currently
limited training data available for learning such models. In this paper, we
introduce PoseBERT, a transformer module that is fully trained on 3D Motion
Capture (MoCap) data via masked modeling. It is simple, generic and versatile,
as it can be plugged on top of any image-based model to transform it in a
video-based model leveraging temporal information. We showcase variants of
PoseBERT with different inputs varying from 3D skeleton keypoints to rotations
of a 3D parametric model for either the full body (SMPL) or just the hands
(MANO). Since PoseBERT training is task agnostic, the model can be applied to
several tasks such as pose refinement, future pose prediction or motion
completion without finetuning. Our experimental results validate that adding
PoseBERT on top of various state-of-the-art pose estimation methods
consistently improves their performances, while its low computational cost
allows us to use it in a real-time demo for smoothly animating a robotic hand
via a webcam. Test code and models are available at
https://github.com/naver/posebert.
Related papers
- HeadCraft: Modeling High-Detail Shape Variations for Animated 3DMMs [9.790185628415301]
We introduce a generative model for detailed 3D head meshes on top of an articulated 3DMM.
We train a StyleGAN model in order to generalize over the UV maps of displacements.
We demonstrate the results of unconditional generation and fitting to the full or partial observation.
arXiv Detail & Related papers (2023-12-21T18:57:52Z) - Ponymation: Learning 3D Animal Motions from Unlabeled Online Videos [50.83155160955368]
We introduce a new method for learning a generative model articulated 3D animal motions from raw unlabeled online videos.
Our model does not require any pose annotations or shape models for training, and is learned purely from a collection of raw video clips obtained from the Internet.
arXiv Detail & Related papers (2023-12-21T06:44:18Z) - Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting
Transformers [28.586258731448687]
We present a Transformer-based pose uplifting scheme that can operate on temporally sparse 2D pose sequences.
We show how masked token modeling can be utilized for temporal upsampling within Transformer blocks.
We evaluate our method on two popular benchmark datasets: Human3.6M and MPI-INF-3DHP.
arXiv Detail & Related papers (2022-10-12T12:00:56Z) - T3VIP: Transformation-based 3D Video Prediction [49.178585201673364]
We propose a 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts.
Our model is fully unsupervised, captures the nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals.
To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera.
arXiv Detail & Related papers (2022-09-19T15:01:09Z) - BANMo: Building Animatable 3D Neural Models from Many Casual Videos [135.64291166057373]
We present BANMo, a method that requires neither a specialized sensor nor a pre-defined template shape.
Banmo builds high-fidelity, articulated 3D models from many monocular casual videos in a differentiable rendering framework.
On real and synthetic datasets, BANMo shows higher-fidelity 3D reconstructions than prior works for humans and animals.
arXiv Detail & Related papers (2021-12-23T18:30:31Z) - Human Performance Capture from Monocular Video in the Wild [50.34917313325813]
We propose a method capable of capturing the dynamic 3D human shape from a monocular video featuring challenging body poses.
Our method outperforms state-of-the-art methods on an in-the-wild human video dataset 3DPW.
arXiv Detail & Related papers (2021-11-29T16:32:41Z) - Leveraging MoCap Data for Human Mesh Recovery [27.76352018682937]
We study whether poses from 3D Motion Capture (MoCap) data can be used to improve image-based and video-based human mesh recovery methods.
We find that fine-tune image-based models with synthetic renderings from MoCap data can increase their performance.
We introduce PoseBERT, a transformer module that directly regresses the pose parameters and is trained via masked modeling.
arXiv Detail & Related papers (2021-10-18T12:43:00Z) - Vid2Actor: Free-viewpoint Animatable Person Synthesis from Video in the
Wild [22.881898195409885]
Given an "in-the-wild" video of a person, we reconstruct an animatable model of the person in the video.
The output model can be rendered in any body pose to any camera view, via the learned controls, without explicit 3D mesh reconstruction.
arXiv Detail & Related papers (2020-12-23T18:50:42Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z) - Monocular Real-time Hand Shape and Motion Capture using Multi-modal Data [77.34069717612493]
We present a novel method for monocular hand shape and pose estimation at unprecedented runtime performance of 100fps.
This is enabled by a new learning based architecture designed such that it can make use of all the sources of available hand training data.
It features a 3D hand joint detection module and an inverse kinematics module which regresses not only 3D joint positions but also maps them to joint rotations in a single feed-forward pass.
arXiv Detail & Related papers (2020-03-21T03:51:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.