Related papers: PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling

PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling

URL: http://arxiv.org/abs/2208.10211v1
Date: Mon, 22 Aug 2022 11:30:14 GMT
Title: PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling
Authors: Fabien Baradel, Romain Br\'egier, Thibault Groueix, Philippe Weinzaepfel, Yannis Kalantidis, Gr\'egory Rogez
Abstract summary: PoseBERT is a transformer module that is fully trained on 3D Motion Capture data via masked modeling. It is simple, generic and versatile, as it can be plugged on top of any image-based model to transform it in a video-based model. Our experimental results validate that adding PoseBERT on top of various state-of-the-art pose estimation methods consistently improves their performances.
Score: 23.420076136028687
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training state-of-the-art models for human pose estimation in videos requires datasets with annotations that are really hard and expensive to obtain. Although transformers have been recently utilized for body pose sequence modeling, related methods rely on pseudo-ground truth to augment the currently limited training data available for learning such models. In this paper, we introduce PoseBERT, a transformer module that is fully trained on 3D Motion Capture (MoCap) data via masked modeling. It is simple, generic and versatile, as it can be plugged on top of any image-based model to transform it in a video-based model leveraging temporal information. We showcase variants of PoseBERT with different inputs varying from 3D skeleton keypoints to rotations of a 3D parametric model for either the full body (SMPL) or just the hands (MANO). Since PoseBERT training is task agnostic, the model can be applied to several tasks such as pose refinement, future pose prediction or motion completion without finetuning. Our experimental results validate that adding PoseBERT on top of various state-of-the-art pose estimation methods consistently improves their performances, while its low computational cost allows us to use it in a real-time demo for smoothly animating a robotic hand via a webcam. Test code and models are available at https://github.com/naver/posebert.

Related papers

UVRM: A Scalable 3D Reconstruction Model from Unposed Videos [68.34221167200259]
Training 3D reconstruction models with 2D visual data traditionally requires prior knowledge of camera poses for the training samples. We introduce UVRM, a novel 3D reconstruction model capable of being trained and evaluated on monocular videos without requiring any information about the pose.
arXiv Detail & Related papers (2025-01-16T08:00:17Z)
HeadCraft: Modeling High-Detail Shape Variations for Animated 3DMMs [9.790185628415301]
We introduce a generative model for detailed 3D head meshes on top of an articulated 3DMM. We train a StyleGAN model in order to generalize over the UV maps of displacements. We demonstrate the results of unconditional generation and fitting to the full or partial observation.
arXiv Detail & Related papers (2023-12-21T18:57:52Z)
SoloPose: One-Shot Kinematic 3D Human Pose Estimation with Video Data Augmentation [0.4218593777811082]
SoloPose is a one-shot, many-to-many-temporal transformer model for kinematic 3D human pose estimation of video. HeatPose is a 3D heatmap based on Gaussian Mixture Model distributions that factors target key points as well as kinematically adjacent key points. 3D AugMotion Toolkit is a methodology to augment existing 3D human pose datasets.
arXiv Detail & Related papers (2023-12-15T20:45:04Z)
Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers [28.586258731448687]
We present a Transformer-based pose uplifting scheme that can operate on temporally sparse 2D pose sequences. We show how masked token modeling can be utilized for temporal upsampling within Transformer blocks. We evaluate our method on two popular benchmark datasets: Human3.6M and MPI-INF-3DHP.
arXiv Detail & Related papers (2022-10-12T12:00:56Z)
T3VIP: Transformation-based 3D Video Prediction [49.178585201673364]
We propose a 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts. Our model is fully unsupervised, captures the nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals. To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera.
arXiv Detail & Related papers (2022-09-19T15:01:09Z)
BANMo: Building Animatable 3D Neural Models from Many Casual Videos [135.64291166057373]
We present BANMo, a method that requires neither a specialized sensor nor a pre-defined template shape. Banmo builds high-fidelity, articulated 3D models from many monocular casual videos in a differentiable rendering framework. On real and synthetic datasets, BANMo shows higher-fidelity 3D reconstructions than prior works for humans and animals.
arXiv Detail & Related papers (2021-12-23T18:30:31Z)
Human Performance Capture from Monocular Video in the Wild [50.34917313325813]
We propose a method capable of capturing the dynamic 3D human shape from a monocular video featuring challenging body poses. Our method outperforms state-of-the-art methods on an in-the-wild human video dataset 3DPW.
arXiv Detail & Related papers (2021-11-29T16:32:41Z)
Leveraging MoCap Data for Human Mesh Recovery [27.76352018682937]
We study whether poses from 3D Motion Capture (MoCap) data can be used to improve image-based and video-based human mesh recovery methods. We find that fine-tune image-based models with synthetic renderings from MoCap data can increase their performance. We introduce PoseBERT, a transformer module that directly regresses the pose parameters and is trained via masked modeling.
arXiv Detail & Related papers (2021-10-18T12:43:00Z)
Vid2Actor: Free-viewpoint Animatable Person Synthesis from Video in the Wild [22.881898195409885]
Given an "in-the-wild" video of a person, we reconstruct an animatable model of the person in the video. The output model can be rendered in any body pose to any camera view, via the learned controls, without explicit 3D mesh reconstruction.
arXiv Detail & Related papers (2020-12-23T18:50:42Z)
Combining Implicit Function Learning and Parametric Models for 3D Human Reconstruction [123.62341095156611]
Implicit functions represented as deep learning approximations are powerful for reconstructing 3D surfaces. Such features are essential in building flexible models for both computer graphics and computer vision. We present methodology that combines detail-rich implicit functions and parametric representations.
arXiv Detail & Related papers (2020-07-22T13:46:14Z)
Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames. Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
Monocular Real-time Hand Shape and Motion Capture using Multi-modal Data [77.34069717612493]
We present a novel method for monocular hand shape and pose estimation at unprecedented runtime performance of 100fps. This is enabled by a new learning based architecture designed such that it can make use of all the sources of available hand training data. It features a 3D hand joint detection module and an inverse kinematics module which regresses not only 3D joint positions but also maps them to joint rotations in a single feed-forward pass.
arXiv Detail & Related papers (2020-03-21T03:51:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.