VoxelKeypointFusion: Generalizable Multi-View Multi-Person Pose Estimation
- URL: http://arxiv.org/abs/2410.18723v1
- Date: Thu, 24 Oct 2024 13:28:40 GMT
- Title: VoxelKeypointFusion: Generalizable Multi-View Multi-Person Pose Estimation
- Authors: Daniel Bermuth, Alexander Poeppel, Wolfgang Reif,
- Abstract summary: This work presents an evaluation of the generalization capabilities of multi-view multi-person pose estimators to unseen datasets.
It also studies the improvements by additionally using depth information.
Since the new approach can not only generalize well to unseen datasets, but also to different keypoints, the first multi-view multi-person whole-body estimator is presented.
- Score: 45.085830389820956
- License:
- Abstract: In the rapidly evolving field of computer vision, the task of accurately estimating the poses of multiple individuals from various viewpoints presents a formidable challenge, especially if the estimations should be reliable as well. This work presents an extensive evaluation of the generalization capabilities of multi-view multi-person pose estimators to unseen datasets and presents a new algorithm with strong performance in this task. It also studies the improvements by additionally using depth information. Since the new approach can not only generalize well to unseen datasets, but also to different keypoints, the first multi-view multi-person whole-body estimator is presented. To support further research on those topics, all of the work is publicly accessible.
Related papers
- SimpleDepthPose: Fast and Reliable Human Pose Estimation with RGBD-Images [45.085830389820956]
This paper introduces a novel algorithm that excels in multi-view, multi-person pose estimation by incorporating depth information.
An extensive evaluation demonstrates that the proposed algorithm not only generalizes well to unseen datasets, and shows a fast runtime performance, but also is adaptable to different keypoints.
arXiv Detail & Related papers (2025-01-30T16:51:40Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.
We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.
Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception [37.667147915777534]
Human-centric perception is a long-standing problem for computer vision.
This paper introduces a unified and versatile framework (HQNet) for single-stage multi-person multi-task human-centric perception (HCP)
Human Query captures intricate instance-level features for individual persons and disentangles complex multi-person scenarios.
arXiv Detail & Related papers (2023-12-09T10:36:43Z) - HaMuCo: Hand Pose Estimation via Multiview Collaborative Self-Supervised
Learning [19.432034725468217]
HaMuCo is a self-supervised learning framework that learns a single-view hand pose estimator from multi-view pseudo 2D labels.
We introduce a cross-view interaction network that distills the single-view estimator by utilizing the cross-view correlated features.
Our method can achieve state-of-the-art performance on multi-view self-supervised hand pose estimation.
arXiv Detail & Related papers (2023-02-02T10:13:04Z) - Two-level Data Augmentation for Calibrated Multi-view Detection [51.5746691103591]
We introduce a new multi-view data augmentation pipeline that preserves alignment among views.
We also propose a second level of augmentation applied directly at the scene level.
When combined with our simple multi-view detection model, our two-level augmentation pipeline outperforms all existing baselines.
arXiv Detail & Related papers (2022-10-19T17:55:13Z) - Multi-View representation learning in Multi-Task Scene [4.509968166110557]
We propose a novel semi-supervised algorithm, termed as Multi-Task Multi-View learning based on Common and Special Features (MTMVCSF)
An anti-noise multi-task multi-view algorithm called AN-MTMVCSF is proposed, which has a strong adaptability to noise labels.
The effectiveness of these algorithms is proved by a series of well-designed experiments on both real world and synthetic data.
arXiv Detail & Related papers (2022-01-15T11:26:28Z) - The Multimodal Sentiment Analysis in Car Reviews (MuSe-CaR) Dataset:
Collection, Insights and Improvements [14.707930573950787]
We present MuSe-CaR, a first of its kind multimodal dataset.
The data is publicly available as it recently served as the testing bed for the 1st Multimodal Sentiment Analysis Challenge.
arXiv Detail & Related papers (2021-01-15T10:40:37Z) - Multi-Domain Adversarial Feature Generalization for Person
Re-Identification [52.835955258959785]
We propose a multi-dataset feature generalization network (MMFA-AAE)
It is capable of learning a universal domain-invariant feature representation from multiple labeled datasets and generalizing it to unseen' camera systems.
It also surpasses many state-of-the-art supervised methods and unsupervised domain adaptation methods by a large margin.
arXiv Detail & Related papers (2020-11-25T08:03:15Z) - Deep Learning for Person Re-identification: A Survey and Outlook [233.36948173686602]
Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras.
By dissecting the involved components in developing a person Re-ID system, we categorize it into the closed-world and open-world settings.
arXiv Detail & Related papers (2020-01-13T12:49:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.