Human Pose Estimation in Monocular Omnidirectional Top-View Images
- URL: http://arxiv.org/abs/2304.08186v1
- Date: Mon, 17 Apr 2023 11:52:04 GMT
- Title: Human Pose Estimation in Monocular Omnidirectional Top-View Images
- Authors: Jingrui Yu, Tobias Scheck, Roman Seidel, Yukti Adya, Dipankar Nandi,
Gangolf Hirtz
- Abstract summary: We propose a new dataset for training and evaluation of CNNs for the task of keypoint detection in omnidirectional images.
The training dataset, THEODORE+, consists of 50,000 images and is created by a 3D rendering engine.
For evaluation purposes, the real-world PoseFES dataset with two scenarios and 701 frames with up to eight persons per scene was captured and annotated.
- Score: 3.07869141026886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human pose estimation (HPE) with convolutional neural networks (CNNs) for
indoor monitoring is one of the major challenges in computer vision. In
contrast to HPE in perspective views, an indoor monitoring system can consist
of an omnidirectional camera with a field of view of 180{\deg} to detect the
pose of a person with only one sensor per room. To recognize human pose, the
detection of keypoints is an essential upstream step. In our work we propose a
new dataset for training and evaluation of CNNs for the task of keypoint
detection in omnidirectional images. The training dataset, THEODORE+, consists
of 50,000 images and is created by a 3D rendering engine, where humans are
randomly walking through an indoor environment. In a dynamically created 3D
scene, persons move randomly with simultaneously moving omnidirectional camera
to generate synthetic RGB images and 2D and 3D ground truth. For evaluation
purposes, the real-world PoseFES dataset with two scenarios and 701 frames with
up to eight persons per scene was captured and annotated. We propose four
training paradigms to finetune or re-train two top-down models in MMPose and
two bottom-up models in CenterNet on THEODORE+. Beside a qualitative evaluation
we report quantitative results. Compared to a COCO pretrained baseline, we
achieve significant improvements especially for top-view scenes on the PoseFES
dataset. Our datasets can be found at
https://www.tu-chemnitz.de/etit/dst/forschung/comp_vision/datasets/index.php.en.
Related papers
- CameraHMR: Aligning People with Perspective [54.05758012879385]
We address the challenge of accurate 3D human pose and shape estimation from monocular images.
Existing training datasets containing real images with pseudo ground truth (pGT) use SMPLify to fit SMPL to sparse 2D joint locations.
We make two contributions that improve pGT accuracy.
arXiv Detail & Related papers (2024-11-12T19:12:12Z) - Self-learning Canonical Space for Multi-view 3D Human Pose Estimation [57.969696744428475]
Multi-view 3D human pose estimation is naturally superior to single view one.
The accurate annotation of these information is hard to obtain.
We propose a fully self-supervised framework, named cascaded multi-view aggregating network (CMANet)
CMANet is superior to state-of-the-art methods in extensive quantitative and qualitative analysis.
arXiv Detail & Related papers (2024-03-19T04:54:59Z) - Learning to Estimate 3D Human Pose from Point Cloud [13.27496851711973]
We propose a deep human pose network for 3D pose estimation by taking the point cloud data as input data to model the surface of complex human structures.
Our experiments on two public datasets show that our approach achieves higher accuracy than previous state-of-art methods.
arXiv Detail & Related papers (2022-12-25T14:22:01Z) - Embodied Scene-aware Human Pose Estimation [25.094152307452]
We propose embodied scene-aware human pose estimation.
Our method is one stage, causal, and recovers global 3D human poses in a simulated environment.
arXiv Detail & Related papers (2022-06-18T03:50:19Z) - Simple and Effective Synthesis of Indoor 3D Scenes [78.95697556834536]
We study the problem of immersive 3D indoor scenes from one or more images.
Our aim is to generate high-resolution images and videos from novel viewpoints.
We propose an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images.
arXiv Detail & Related papers (2022-04-06T17:54:46Z) - Learning Temporal 3D Human Pose Estimation with Pseudo-Labels [3.0954251281114513]
We present a simple, yet effective, approach for self-supervised 3D human pose estimation.
We rely on triangulating 2D body pose estimates of a multiple-view camera system.
Our method achieves state-of-the-art performance in the Human3.6M and MPI-INF-3DHP benchmarks.
arXiv Detail & Related papers (2021-10-14T17:40:45Z) - Self-Supervised 3D Human Pose Estimation with Multiple-View Geometry [2.7541825072548805]
We present a self-supervised learning algorithm for 3D human pose estimation of a single person based on a multiple-view camera system.
We propose a four-loss function learning algorithm, which does not require any 2D or 3D body pose ground-truth.
arXiv Detail & Related papers (2021-08-17T17:31:24Z) - Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh
Recovery from a 2D Human Pose [70.23652933572647]
We propose a novel graph convolutional neural network (GraphCNN)-based system that estimates the 3D coordinates of human mesh vertices directly from the 2D human pose.
We show that our Pose2Mesh outperforms the previous 3D human pose and mesh estimation methods on various benchmark datasets.
arXiv Detail & Related papers (2020-08-20T16:01:56Z) - HDNet: Human Depth Estimation for Multi-Person Camera-Space Localization [83.57863764231655]
We propose the Human Depth Estimation Network (HDNet), an end-to-end framework for absolute root joint localization.
A skeleton-based Graph Neural Network (GNN) is utilized to propagate features among joints.
We evaluate our HDNet on the root joint localization and root-relative 3D pose estimation tasks with two benchmark datasets.
arXiv Detail & Related papers (2020-07-17T12:44:23Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.