Related papers: Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild

Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild

URL: http://arxiv.org/abs/2309.08644v1
Date: Fri, 15 Sep 2023 06:17:22 GMT
Title: Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild
Authors: Sungchan Park, Eunyi You, Inhoe Lee, Joonseok Lee
Abstract summary: POTR-3D is a sequence-to-sequence 2D-to-3D lifting model for 3DMPPE. It robustly generalizes to diverse unseen views, robustly recovers the poses against heavy occlusions, and reliably generates more natural and smoother outputs.
Score: 10.849750765175754
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D pose estimation is an invaluable task in computer vision with various practical applications. Especially, 3D pose estimation for multi-person from a monocular video (3DMPPE) is particularly challenging and is still largely uncharted, far from applying to in-the-wild scenarios yet. We pose three unresolved issues with the existing methods: lack of robustness on unseen views during training, vulnerability to occlusion, and severe jittering in the output. As a remedy, we propose POTR-3D, the first realization of a sequence-to-sequence 2D-to-3D lifting model for 3DMPPE, powered by a novel geometry-aware data augmentation strategy, capable of generating unbounded data with a variety of views while caring about the ground plane and occlusions. Through extensive experiments, we verify that the proposed model and data augmentation robustly generalizes to diverse unseen views, robustly recovers the poses against heavy occlusions, and reliably generates more natural and smoother outputs. The effectiveness of our approach is verified not only by achieving the state-of-the-art performance on public benchmarks, but also by qualitative results on more challenging in-the-wild videos. Demo videos are available at https://www.youtube.com/@potr3d.

Related papers

F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet with Cycle-Aggregative Gaussian Splatting [35.625593119642424]
This paper tackles the problem of generalizable 3D-aware generation from monocular datasets. We propose a novel feed-forward pipeline based on pixel-aligned Gaussian Splatting. We also introduce a self-supervised cycle-aggregative constraint to enforce cross-view consistency in the learned 3D representation.
arXiv Detail & Related papers (2025-01-12T04:44:44Z)
LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors [107.83398512719981]
Single-image 3D reconstruction remains a fundamental challenge in computer vision. Recent advances in Latent Video Diffusion Models offer promising 3D priors learned from large-scale video data. We propose LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency.
arXiv Detail & Related papers (2024-12-12T18:58:42Z)
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
Two Views Are Better than One: Monocular 3D Pose Estimation with Multiview Consistency [0.493599216374976]
We propose a novel loss function, multiview consistency, to enable adding additional training data with only 2D supervision. Our experiments demonstrate that two views offset by 90 degrees are enough to obtain good performance, with only marginal improvements by adding more views. This research introduces new possibilities for domain adaptation in 3D pose estimation, providing a practical and cost-effective solution to customize models for specific applications.
arXiv Detail & Related papers (2023-11-21T08:21:55Z)
On Triangulation as a Form of Self-Supervision for 3D Human Pose Estimation [57.766049538913926]
Supervised approaches to 3D pose estimation from single images are remarkably effective when labeled data is abundant. Much of the recent attention has shifted towards semi and (or) weakly supervised learning. We propose to impose multi-view geometrical constraints by means of a differentiable triangulation and to use it as form of self-supervision during training when no labels are available.
arXiv Detail & Related papers (2022-03-29T19:11:54Z)
MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision [72.5863451123577]
We show how to train a neural model that can perform accurate 3D pose and camera estimation. Our method outperforms both classical bundle adjustment and weakly-supervised monocular 3D baselines.
arXiv Detail & Related papers (2021-08-10T18:39:56Z)
Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution [34.301501457959056]
We propose a temporal regression network with a gated convolution module to transform 2D joints to 3D. A simple yet effective localization approach is also conducted to transform the normalized pose to the global trajectory. Our proposed method outperforms most state-of-the-art 2D-to-3D pose estimation methods.
arXiv Detail & Related papers (2020-10-31T04:35:24Z)
Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames. Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
Monocular Real-time Hand Shape and Motion Capture using Multi-modal Data [77.34069717612493]
We present a novel method for monocular hand shape and pose estimation at unprecedented runtime performance of 100fps. This is enabled by a new learning based architecture designed such that it can make use of all the sources of available hand training data. It features a 3D hand joint detection module and an inverse kinematics module which regresses not only 3D joint positions but also maps them to joint rotations in a single feed-forward pass.
arXiv Detail & Related papers (2020-03-21T03:51:54Z)
Weakly-Supervised 3D Human Pose Learning via Multi-view Images in the Wild [101.70320427145388]
We propose a weakly-supervised approach that does not require 3D annotations and learns to estimate 3D poses from unlabeled multi-view data. We evaluate our proposed approach on two large scale datasets.
arXiv Detail & Related papers (2020-03-17T08:47:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.