HaMuCo: Hand Pose Estimation via Multiview Collaborative Self-Supervised
Learning
- URL: http://arxiv.org/abs/2302.00988v2
- Date: Tue, 15 Aug 2023 04:51:27 GMT
- Title: HaMuCo: Hand Pose Estimation via Multiview Collaborative Self-Supervised
Learning
- Authors: Xiaozheng Zheng, Chao Wen, Zhou Xue, Pengfei Ren, Jingyu Wang
- Abstract summary: HaMuCo is a self-supervised learning framework that learns a single-view hand pose estimator from multi-view pseudo 2D labels.
We introduce a cross-view interaction network that distills the single-view estimator by utilizing the cross-view correlated features.
Our method can achieve state-of-the-art performance on multi-view self-supervised hand pose estimation.
- Score: 19.432034725468217
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in 3D hand pose estimation have shown promising results,
but its effectiveness has primarily relied on the availability of large-scale
annotated datasets, the creation of which is a laborious and costly process. To
alleviate the label-hungry limitation, we propose a self-supervised learning
framework, HaMuCo, that learns a single-view hand pose estimator from
multi-view pseudo 2D labels. However, one of the main challenges of
self-supervised learning is the presence of noisy labels and the ``groupthink''
effect from multiple views. To overcome these issues, we introduce a cross-view
interaction network that distills the single-view estimator by utilizing the
cross-view correlated features and enforcing multi-view consistency to achieve
collaborative learning. Both the single-view estimator and the cross-view
interaction network are trained jointly in an end-to-end manner. Extensive
experiments show that our method can achieve state-of-the-art performance on
multi-view self-supervised hand pose estimation. Furthermore, the proposed
cross-view interaction network can also be applied to hand pose estimation from
multi-view input and outperforms previous methods under the same settings.
Related papers
- VoxelKeypointFusion: Generalizable Multi-View Multi-Person Pose Estimation [45.085830389820956]
This work presents an evaluation of the generalization capabilities of multi-view multi-person pose estimators to unseen datasets.
It also studies the improvements by additionally using depth information.
Since the new approach can not only generalize well to unseen datasets, but also to different keypoints, the first multi-view multi-person whole-body estimator is presented.
arXiv Detail & Related papers (2024-10-24T13:28:40Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Semi-supervised learning made simple with self-supervised clustering [65.98152950607707]
Self-supervised learning models have been shown to learn rich visual representations without requiring human annotations.
We propose a conceptually simple yet empirically powerful approach to turn clustering-based self-supervised methods into semi-supervised learners.
arXiv Detail & Related papers (2023-06-13T01:09:18Z) - Cross-view Graph Contrastive Representation Learning on Partially
Aligned Multi-view Data [52.491074276133325]
Multi-view representation learning has developed rapidly over the past decades and has been applied in many fields.
We propose a new cross-view graph contrastive learning framework, which integrates multi-view information to align data and learn latent representations.
Experiments conducted on several real datasets demonstrate the effectiveness of the proposed method on the clustering and classification tasks.
arXiv Detail & Related papers (2022-11-08T09:19:32Z) - Contrastive Learning with Cross-Modal Knowledge Mining for Multimodal
Human Activity Recognition [1.869225486385596]
We explore the hypothesis that leveraging multiple modalities can lead to better recognition.
We extend a number of recent contrastive self-supervised approaches for the task of Human Activity Recognition.
We propose a flexible, general-purpose framework for performing multimodal self-supervised learning.
arXiv Detail & Related papers (2022-05-20T10:39:16Z) - UniVIP: A Unified Framework for Self-Supervised Visual Pre-training [50.87603616476038]
We propose a novel self-supervised framework to learn versatile visual representations on either single-centric-object or non-iconic dataset.
Massive experiments show that UniVIP pre-trained on non-iconic COCO achieves state-of-the-art transfer performance.
Our method can also exploit single-centric-object dataset such as ImageNet and outperforms BYOL by 2.5% with the same pre-training epochs in linear probing.
arXiv Detail & Related papers (2022-03-14T10:04:04Z) - Active Learning with Pseudo-Labels for Multi-View 3D Pose Estimation [18.768030475943213]
We improve Active Learning for the problem of 3D pose estimation in a multi-view setting.
We develop a framework that allows us to efficiently extend existing single-view AL strategies.
We demonstrate additional performance gains by incorporating predicted pseudo-labels, which is a form of self-training.
arXiv Detail & Related papers (2021-12-27T14:34:25Z) - Learning to Disambiguate Strongly Interacting Hands via Probabilistic
Per-pixel Part Segmentation [84.28064034301445]
Self-similarity, and the resulting ambiguities in assigning pixel observations to the respective hands, is a major cause of the final 3D pose error.
We propose DIGIT, a novel method for estimating the 3D poses of two interacting hands from a single monocular image.
We experimentally show that the proposed approach achieves new state-of-the-art performance on the InterHand2.6M dataset.
arXiv Detail & Related papers (2021-07-01T13:28:02Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z) - Learning View-Disentangled Human Pose Representation by Contrastive
Cross-View Mutual Information Maximization [33.36330493757669]
We introduce a novel representation learning method to disentangle pose-dependent as well as view-dependent factors from 2D human poses.
The method trains a network using cross-view mutual information (CV-MIM) which maximizes mutual information of the same pose performed from different viewpoints.
CV-MIM outperforms other competing methods by a large margin in the single-shot cross-view setting.
arXiv Detail & Related papers (2020-12-02T18:55:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.