A2J-Transformer: Anchor-to-Joint Transformer Network for 3D Interacting
Hand Pose Estimation from a Single RGB Image
- URL: http://arxiv.org/abs/2304.03635v1
- Date: Fri, 7 Apr 2023 13:30:36 GMT
- Title: A2J-Transformer: Anchor-to-Joint Transformer Network for 3D Interacting
Hand Pose Estimation from a Single RGB Image
- Authors: Changlong Jiang, Yang Xiao, Cunlin Wu, Mingyang Zhang, Jinghong Zheng,
Zhiguo Cao, and Joey Tianyi Zhou
- Abstract summary: We propose to extend A2J-the state-of-the-art depth-based 3D single hand pose estimation method-to RGB domain under interacting hand condition.
A2J is evolved under Transformer's non-local encoding-decoding framework to build A2J-Transformer.
Experiments on challenging InterHand 2.6M demonstrate that, A2J-Transformer can achieve state-of-the-art model-free performance.
- Score: 46.5947382684857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D interacting hand pose estimation from a single RGB image is a challenging
task, due to serious self-occlusion and inter-occlusion towards hands,
confusing similar appearance patterns between 2 hands, ill-posed joint position
mapping from 2D to 3D, etc.. To address these, we propose to extend A2J-the
state-of-the-art depth-based 3D single hand pose estimation method-to RGB
domain under interacting hand condition. Our key idea is to equip A2J with
strong local-global aware ability to well capture interacting hands' local fine
details and global articulated clues among joints jointly. To this end, A2J is
evolved under Transformer's non-local encoding-decoding framework to build
A2J-Transformer. It holds 3 main advantages over A2J. First, self-attention
across local anchor points is built to make them global spatial context aware
to better capture joints' articulation clues for resisting occlusion. Secondly,
each anchor point is regarded as learnable query with adaptive feature learning
for facilitating pattern fitting capacity, instead of having the same local
representation with the others. Last but not least, anchor point locates in 3D
space instead of 2D as in A2J, to leverage 3D pose prediction. Experiments on
challenging InterHand 2.6M demonstrate that, A2J-Transformer can achieve
state-of-the-art model-free performance (3.38mm MPJPE advancement in 2-hand
case) and can also be applied to depth domain with strong generalization.
Related papers
- Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding.
We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z) - A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose
Estimation [18.72362803593654]
The dominant paradigm in 3D human pose estimation that lifts a 2D pose sequence to 3D heavily relies on long-term temporal clues.
This can be attributed to their inherent inability to perceive spatial context as plain 2D joint coordinates carry no visual cues.
We propose a straightforward yet powerful solution: leveraging the readily available intermediate visual representations produced by off-the-shelf (pre-trained) 2D pose detectors.
arXiv Detail & Related papers (2023-11-06T18:04:13Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - Decoupled Iterative Refinement Framework for Interacting Hands
Reconstruction from a Single RGB Image [30.24438569170251]
We propose a decoupled iterative refinement framework to achieve pixel-alignment hand reconstruction.
Our method outperforms all existing two-hand reconstruction methods by a large margin on the InterHand2.6M dataset.
arXiv Detail & Related papers (2023-02-05T15:46:57Z) - Asymmetric 3D Context Fusion for Universal Lesion Detection [55.61873234187917]
3D networks are strong in 3D context yet lack supervised pretraining.
Existing 3D context fusion operators are designed to be spatially symmetric, performing identical operations on each 2D slice like convolutions.
We propose a novel asymmetric 3D context fusion operator (A3D), which uses different weights to fuse 3D context from different 2D slices.
arXiv Detail & Related papers (2021-09-17T16:25:10Z) - RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB
Video [76.86512780916827]
We present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera.
In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN.
We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline.
arXiv Detail & Related papers (2021-06-22T12:53:56Z) - A hybrid classification-regression approach for 3D hand pose estimation
using graph convolutional networks [1.0152838128195467]
We propose a two-stage GCN-based framework that learns per-pose relationship constraints.
The first phase quantizes the 2D/3D space to classify the joints into 2D/3D blocks based on their locality.
The second stage uses a GCN-based module that uses an adaptative nearest neighbor algorithm to determine joint relationships.
arXiv Detail & Related papers (2021-05-23T10:09:10Z) - HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation
ofHands and Object in Interaction [33.661745138578596]
We propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image.
Our method starts by extracting a set of potential 2D locations for the joints of both hands as extrema of a heatmap.
We use appearance and spatial encodings of these locations as input to a transformer, and leverage the attention mechanisms to sort out the correct configuration of the joints.
arXiv Detail & Related papers (2021-04-29T20:19:20Z) - Bidirectional Projection Network for Cross Dimension Scene Understanding [69.29443390126805]
We present a emphbidirectional projection network (BPNet) for joint 2D and 3D reasoning in an end-to-end manner.
Via the emphBPM, complementary 2D and 3D information can interact with each other in multiple architectural levels.
Our emphBPNet achieves top performance on the ScanNetV2 benchmark for both 2D and 3D semantic segmentation.
arXiv Detail & Related papers (2021-03-26T08:31:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.