HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction
- URL: http://arxiv.org/abs/2406.06843v2
- Date: Sun, 16 Jun 2024 20:51:53 GMT
- Title: HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction
- Authors: Jikai Wang, Qifan Zhang, Yu-Wei Chao, Bowen Wen, Xiaohu Guo, Yu Xiang,
- Abstract summary: We introduce a new dataset named HO-Cap that can be used to study 3D reconstruction and pose tracking of hands and objects in videos.
We propose a semi-automatic method to obtain annotations of shape and pose of hands and objects in the collected videos.
Our data capture setup and annotation framework can be used by the community to reconstruct 3D shapes of objects and human hands and track their poses in videos.
- Score: 16.363878619678367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a data capture system and a new dataset named HO-Cap that can be used to study 3D reconstruction and pose tracking of hands and objects in videos. The capture system uses multiple RGB-D cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method to obtain annotations of shape and pose of hands and objects in the collected videos, which significantly reduces the required annotation time compared to manual labeling. With this system, we captured a video dataset of humans using objects to perform different tasks, as well as simple pick-and-place and handover of an object from one hand to the other, which can be used as human demonstrations for embodied AI and robot manipulation research. Our data capture setup and annotation framework can be used by the community to reconstruct 3D shapes of objects and human hands and track their poses in videos.
Related papers
- Introducing HOT3D: An Egocentric Dataset for 3D Hand and Object Tracking [7.443420525809604]
We introduce HOT3D, a dataset for egocentric hand and object tracking in 3D.
The dataset offers over 833 minutes of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects.
In addition to simple pick-up/observe/put-down actions, HOT3D contains scenarios resembling typical actions in a kitchen, office, and living room environment.
arXiv Detail & Related papers (2024-06-13T21:38:17Z) - HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and
Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video.
We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images.
Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z) - SHOWMe: Benchmarking Object-agnostic Hand-Object 3D Reconstruction [13.417086460511696]
We introduce the SHOWMe dataset which consists of 96 videos, annotated with real and detailed hand-object 3D textured meshes.
We consider a rigid hand-object scenario, in which the pose of the hand with respect to the object remains constant during the whole video sequence.
This assumption allows us to register sub-millimetre-precise groundtruth 3D scans to the image sequences in SHOWMe.
arXiv Detail & Related papers (2023-09-19T16:48:29Z) - Consistent 3D Hand Reconstruction in Video via self-supervised Learning [67.55449194046996]
We present a method for reconstructing accurate and consistent 3D hands from a monocular video.
detected 2D hand keypoints and the image texture provide important cues about the geometry and texture of the 3D hand.
We propose $rm S2HAND$, a self-supervised 3D hand reconstruction model.
arXiv Detail & Related papers (2022-01-24T09:44:11Z) - Multi-View Video-Based 3D Hand Pose Estimation [11.65577683784217]
We present the Multi-View Video-Based 3D Hand dataset, consisting of multi-view videos of the hand along with ground-truth 3D pose labels.
Our dataset includes more than 402,000 synthetic hand images available in 4,560 videos.
Next, we implement MuViHandNet, a neural pipeline consisting of image encoders for obtaining visual embeddings of the hand.
arXiv Detail & Related papers (2021-09-24T05:20:41Z) - D3D-HOI: Dynamic 3D Human-Object Interactions from Videos [49.38319295373466]
We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions.
Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints.
We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
arXiv Detail & Related papers (2021-08-19T00:49:01Z) - Towards unconstrained joint hand-object reconstruction from RGB videos [81.97694449736414]
Reconstructing hand-object manipulations holds a great potential for robotics and learning from human demonstrations.
We first propose a learning-free fitting approach for hand-object reconstruction which can seamlessly handle two-hand object interactions.
arXiv Detail & Related papers (2021-08-16T12:26:34Z) - H2O: Two Hands Manipulating Objects for First Person Interaction
Recognition [70.46638409156772]
We present a comprehensive framework for egocentric interaction recognition using markerless 3D annotations of two hands manipulating objects.
Our method produces annotations of the 3D pose of two hands and the 6D pose of the manipulated objects, along with their interaction labels for each frame.
Our dataset, called H2O (2 Hands and Objects), provides synchronized multi-view RGB-D images, interaction labels, object classes, ground-truth 3D poses for left & right hands, 6D object poses, ground-truth camera poses, object meshes and scene point clouds.
arXiv Detail & Related papers (2021-04-22T17:10:42Z) - GRAB: A Dataset of Whole-Body Human Grasping of Objects [53.00728704389501]
Training computers to understand human grasping requires a rich dataset containing complex 3D object shapes, detailed contact information, hand pose and shape, and the 3D body motion over time.
We collect a new dataset, called GRAB, of whole-body grasps, containing full 3D shape and pose sequences of 10 subjects interacting with 51 everyday objects of varying shape and size.
This is a unique dataset, that goes well beyond existing ones for modeling and understanding how humans grasp and manipulate objects, how their full body is involved, and how interaction varies with the task.
arXiv Detail & Related papers (2020-08-25T17:57:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.