Related papers: Transformer-based Global 3D Hand Pose Estimation in Two Hands Manipulating Objects Scenarios

Transformer-based Global 3D Hand Pose Estimation in Two Hands Manipulating Objects Scenarios

URL: http://arxiv.org/abs/2210.11384v1
Date: Thu, 20 Oct 2022 16:24:47 GMT
Title: Transformer-based Global 3D Hand Pose Estimation in Two Hands Manipulating Objects Scenarios
Authors: Hoseong Cho, Donguk Kim, Chanwoo Kim, Seongyeong Lee and Seungryul Baek
Abstract summary: This report describes our 1st place solution to ECCV 2022 challenge on Human Body, Hands, and Activities (HBHA) from Egocentric and Multi-view Cameras (hand pose estimation) In this challenge, we aim to estimate global 3D hand poses from the input image where two hands and an object are interacting on the egocentric viewpoint. Our proposed method performs end-to-end multi-hand pose estimation via transformer architecture.
Score: 13.59950629234404
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This report describes our 1st place solution to ECCV 2022 challenge on Human Body, Hands, and Activities (HBHA) from Egocentric and Multi-view Cameras (hand pose estimation). In this challenge, we aim to estimate global 3D hand poses from the input image where two hands and an object are interacting on the egocentric viewpoint. Our proposed method performs end-to-end multi-hand pose estimation via transformer architecture. In particular, our method robustly estimates hand poses in a scenario where two hands interact. Additionally, we propose an algorithm that considers hand scales to robustly estimate the absolute depth. The proposed algorithm works well even when the hand sizes are various for each person. Our method attains 14.4 mm (left) and 15.9 mm (right) errors for each hand in the test set.

Related papers

UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation [82.93208597526503]
Existing methods are specialized, focusing on either bare-hand or hand interacting with object. No method can flexibly handle both scenarios and their performance degrades when applied to the other scenario. We propose UniHOPE, a unified approach for general 3D hand-object pose estimation.
arXiv Detail & Related papers (2025-03-17T15:46:43Z)
SHARP: Segmentation of Hands and Arms by Range using Pseudo-Depth for Enhanced Egocentric 3D Hand Pose Estimation and Action Recognition [5.359837526794863]
Hand pose represents key information for action recognition in the egocentric perspective. We propose to improve egocentric 3D hand pose estimation based on RGB frames only by using pseudo-depth images.
arXiv Detail & Related papers (2024-08-19T14:30:29Z)
In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition [1.4732811715354455]
Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort. Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor. We introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective.
arXiv Detail & Related papers (2024-04-14T17:33:33Z)
HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud [60.47544798202017]
Hand pose estimation is a critical task in various human-computer interaction applications. This paper proposes HandDiff, a diffusion-based hand pose estimation model that iteratively denoises accurate hand pose conditioned on hand-shaped image-point clouds. Experimental results demonstrate that the proposed HandDiff significantly outperforms the existing approaches on four challenging hand pose benchmark datasets.
arXiv Detail & Related papers (2024-04-04T02:15:16Z)
Deformer: Dynamic Fusion Transformer for Robust Hand Pose Estimation [59.3035531612715]
Existing methods often struggle to generate plausible hand poses when the hand is heavily occluded or blurred. In videos, the movements of the hand allow us to observe various parts of the hand that may be occluded or blurred in a single frame. We propose the Deformer: a framework that implicitly reasons about the relationship between hand parts within the same image.
arXiv Detail & Related papers (2023-03-09T02:24:30Z)
LG-Hand: Advancing 3D Hand Pose Estimation with Locally and Globally Kinematic Knowledge [0.693939291118954]
We propose LG-Hand, a powerful method for 3D hand pose estimation. We argue that kinematic information plays an important role, contributing to the performance of 3D hand pose estimation. Our method achieves promising results on the First-Person Hand Action Benchmark dataset.
arXiv Detail & Related papers (2022-11-06T15:26:32Z)
3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal [85.30756038989057]
Estimating 3D interacting hand pose from a single RGB image is essential for understanding human actions. We propose to decompose the challenging interacting hand pose estimation task and estimate the pose of each hand separately. Experiments show that the proposed method significantly outperforms previous state-of-the-art interacting hand pose estimation approaches.
arXiv Detail & Related papers (2022-07-22T13:04:06Z)
Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation [84.28064034301445]
Self-similarity, and the resulting ambiguities in assigning pixel observations to the respective hands, is a major cause of the final 3D pose error. We propose DIGIT, a novel method for estimating the 3D poses of two interacting hands from a single monocular image. We experimentally show that the proposed approach achieves new state-of-the-art performance on the InterHand2.6M dataset.
arXiv Detail & Related papers (2021-07-01T13:28:02Z)
HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction [33.661745138578596]
We propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image. Our method starts by extracting a set of potential 2D locations for the joints of both hands as extrema of a heatmap. We use appearance and spatial encodings of these locations as input to a transformer, and leverage the attention mechanisms to sort out the correct configuration of the joints.
arXiv Detail & Related papers (2021-04-29T20:19:20Z)
H2O: Two Hands Manipulating Objects for First Person Interaction Recognition [70.46638409156772]
We present a comprehensive framework for egocentric interaction recognition using markerless 3D annotations of two hands manipulating objects. Our method produces annotations of the 3D pose of two hands and the 6D pose of the manipulated objects, along with their interaction labels for each frame. Our dataset, called H2O (2 Hands and Objects), provides synchronized multi-view RGB-D images, interaction labels, object classes, ground-truth 3D poses for left & right hands, 6D object poses, ground-truth camera poses, object meshes and scene point clouds.
arXiv Detail & Related papers (2021-04-22T17:10:42Z)
Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction [137.28465645405655]
HANDS'19 is a challenge to evaluate the abilities of current 3D hand pose estimators (HPEs) to interpolate and extrapolate the poses of a training set. We show that the accuracy of state-of-the-art methods can drop, and that they fail mostly on poses absent from the training set.
arXiv Detail & Related papers (2020-03-30T19:28:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.