THOR-Net: End-to-end Graformer-based Realistic Two Hands and Object
Reconstruction with Self-supervision
- URL: http://arxiv.org/abs/2210.13853v1
- Date: Tue, 25 Oct 2022 09:18:50 GMT
- Title: THOR-Net: End-to-end Graformer-based Realistic Two Hands and Object
Reconstruction with Self-supervision
- Authors: Ahmed Tawfik Aboukhadra, Jameel Malik, Ahmed Elhayek, Nadia Robertini
and Didier Stricker
- Abstract summary: THOR-Net combines the power of GCNs, Transformer, and self-supervision to reconstruct two hands and an object from a single RGB image.
Our approach achieves State-of-the-art results in Hand shape estimation on the HO-3D dataset (10.0mm)
It also surpasses other methods in hand pose estimation on the challenging two hands and object (H2O) dataset by 5mm on the left-hand pose and 1 mm on the right-hand pose.
- Score: 11.653985098433841
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Realistic reconstruction of two hands interacting with objects is a new and
challenging problem that is essential for building personalized Virtual and
Augmented Reality environments. Graph Convolutional networks (GCNs) allow for
the preservation of the topologies of hands poses and shapes by modeling them
as a graph. In this work, we propose the THOR-Net which combines the power of
GCNs, Transformer, and self-supervision to realistically reconstruct two hands
and an object from a single RGB image. Our network comprises two stages; namely
the features extraction stage and the reconstruction stage. In the features
extraction stage, a Keypoint RCNN is used to extract 2D poses, features maps,
heatmaps, and bounding boxes from a monocular RGB image. Thereafter, this 2D
information is modeled as two graphs and passed to the two branches of the
reconstruction stage. The shape reconstruction branch estimates meshes of two
hands and an object using our novel coarse-to-fine GraFormer shape network. The
3D poses of the hands and objects are reconstructed by the other branch using a
GraFormer network. Finally, a self-supervised photometric loss is used to
directly regress the realistic textured of each vertex in the hands' meshes.
Our approach achieves State-of-the-art results in Hand shape estimation on the
HO-3D dataset (10.0mm) exceeding ArtiBoost (10.8mm). It also surpasses other
methods in hand pose estimation on the challenging two hands and object (H2O)
dataset by 5mm on the left-hand pose and 1 mm on the right-hand pose.
Related papers
- Reconstructing Hand-Held Objects in 3D [53.277402172488735]
We present a paradigm for handheld object reconstruction that builds on recent breakthroughs in large language/vision models and 3D object datasets.
We use GPT-4(V) to retrieve a 3D object model that matches the object in the image and rigidly align the model to the network-inferred geometry.
Experiments demonstrate that MCC-HO achieves state-of-the-art performance on lab and Internet datasets.
arXiv Detail & Related papers (2024-04-09T17:55:41Z) - In-Hand 3D Object Reconstruction from a Monocular RGB Video [17.31419675163019]
Our work aims to reconstruct a 3D object that is held and rotated by a hand in front of a static RGB camera.
Previous methods that use implicit neural representations to recover the geometry of a generic hand-held object from multi-view images achieved compelling results in the visible part of the object.
arXiv Detail & Related papers (2023-12-27T06:19:25Z) - HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and
Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video.
We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images.
Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z) - ShapeGraFormer: GraFormer-Based Network for Hand-Object Reconstruction from a Single Depth Map [11.874184782686532]
We propose the first approach for realistic 3D hand-object shape and pose reconstruction from a single depth map.
Our pipeline additionally predicts voxelized hand-object shapes, having a one-to-one mapping to the input voxelized depth.
In addition, we show the impact of adding another GraFormer component that refines the reconstructed shapes based on the hand-object interactions.
arXiv Detail & Related papers (2023-10-18T09:05:57Z) - SHOWMe: Benchmarking Object-agnostic Hand-Object 3D Reconstruction [13.417086460511696]
We introduce the SHOWMe dataset which consists of 96 videos, annotated with real and detailed hand-object 3D textured meshes.
We consider a rigid hand-object scenario, in which the pose of the hand with respect to the object remains constant during the whole video sequence.
This assumption allows us to register sub-millimetre-precise groundtruth 3D scans to the image sequences in SHOWMe.
arXiv Detail & Related papers (2023-09-19T16:48:29Z) - HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image [41.580285338167315]
This paper presents a method to learn hand-object interaction prior for reconstructing a 3D hand-object scene from a single RGB image.
We use the hand shape to constrain the possible relative configuration of the hand and object geometry.
We show that HandNeRF is able to reconstruct hand-object scenes of novel grasp configurations more accurately than comparable methods.
arXiv Detail & Related papers (2023-09-14T17:42:08Z) - Consistent 3D Hand Reconstruction in Video via self-supervised Learning [67.55449194046996]
We present a method for reconstructing accurate and consistent 3D hands from a monocular video.
detected 2D hand keypoints and the image texture provide important cues about the geometry and texture of the 3D hand.
We propose $rm S2HAND$, a self-supervised 3D hand reconstruction model.
arXiv Detail & Related papers (2022-01-24T09:44:11Z) - Model-based 3D Hand Reconstruction via Self-Supervised Learning [72.0817813032385]
Reconstructing a 3D hand from a single-view RGB image is challenging due to various hand configurations and depth ambiguity.
We propose S2HAND, a self-supervised 3D hand reconstruction network that can jointly estimate pose, shape, texture, and the camera viewpoint.
For the first time, we demonstrate the feasibility of training an accurate 3D hand reconstruction network without relying on manual annotations.
arXiv Detail & Related papers (2021-03-22T10:12:43Z) - Towards Realistic 3D Embedding via View Alignment [53.89445873577063]
This paper presents an innovative View Alignment GAN (VA-GAN) that composes new images by embedding 3D models into 2D background images realistically and automatically.
VA-GAN consists of a texture generator and a differential discriminator that are inter-connected and end-to-end trainable.
arXiv Detail & Related papers (2020-07-14T14:45:00Z) - HandVoxNet: Deep Voxel-Based Network for 3D Hand Shape and Pose
Estimation from a Single Depth Map [72.93634777578336]
We propose a novel architecture with 3D convolutions trained in a weakly-supervised manner.
The proposed approach improves over the state of the art by 47.8% on the SynHand5M dataset.
Our method produces visually more reasonable and realistic hand shapes on NYU and BigHand2.2M datasets.
arXiv Detail & Related papers (2020-04-03T14:27:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.