ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation
- URL: http://arxiv.org/abs/2204.13662v3
- Date: Sun, 23 Apr 2023 13:11:57 GMT
- Title: ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation
- Authors: Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel
Kaufmann, Michael J. Black, and Otmar Hilliges
- Abstract summary: ARCTIC is a dataset of two hands that dexterously manipulate objects.
It contains 2.1M video frames paired with accurate 3D hand meshes and detailed, dynamic contact information.
- Score: 68.80339307258835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans intuitively understand that inanimate objects do not move by
themselves, but that state changes are typically caused by human manipulation
(e.g., the opening of a book). This is not yet the case for machines. In part
this is because there exist no datasets with ground-truth 3D annotations for
the study of physically consistent and synchronised motion of hands and
articulated objects. To this end, we introduce ARCTIC -- a dataset of two hands
that dexterously manipulate objects, containing 2.1M video frames paired with
accurate 3D hand and object meshes and detailed, dynamic contact information.
It contains bi-manual articulation of objects such as scissors or laptops,
where hand poses and object states evolve jointly in time. We propose two novel
articulated hand-object interaction tasks: (1) Consistent motion
reconstruction: Given a monocular video, the goal is to reconstruct two hands
and articulated objects in 3D, so that their motions are spatio-temporally
consistent. (2) Interaction field estimation: Dense relative hand-object
distances must be estimated from images. We introduce two baselines ArcticNet
and InterField, respectively and evaluate them qualitatively and quantitatively
on ARCTIC. Our code and data are available at https://arctic.is.tue.mpg.de.
Related papers
- HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and
Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video.
We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images.
Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z) - SHOWMe: Benchmarking Object-agnostic Hand-Object 3D Reconstruction [13.417086460511696]
We introduce the SHOWMe dataset which consists of 96 videos, annotated with real and detailed hand-object 3D textured meshes.
We consider a rigid hand-object scenario, in which the pose of the hand with respect to the object remains constant during the whole video sequence.
This assumption allows us to register sub-millimetre-precise groundtruth 3D scans to the image sequences in SHOWMe.
arXiv Detail & Related papers (2023-09-19T16:48:29Z) - CaSAR: Contact-aware Skeletal Action Recognition [47.249908147135855]
We present a new framework called Contact-aware Skeletal Action Recognition (CaSAR)
CaSAR uses novel representations of hand-object interaction that encompass spatial information.
Our framework is able to learn how the hands touch or stay away from the objects for each frame of the action sequence, and use this information to predict the action class.
arXiv Detail & Related papers (2023-09-17T09:42:40Z) - ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - Estimating 3D Motion and Forces of Human-Object Interactions from
Internet Videos [49.52070710518688]
We introduce a method to reconstruct the 3D motion of a person interacting with an object from a single RGB video.
Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces on the human body.
arXiv Detail & Related papers (2021-11-02T13:40:18Z) - D3D-HOI: Dynamic 3D Human-Object Interactions from Videos [49.38319295373466]
We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions.
Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints.
We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
arXiv Detail & Related papers (2021-08-19T00:49:01Z) - HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation
ofHands and Object in Interaction [33.661745138578596]
We propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image.
Our method starts by extracting a set of potential 2D locations for the joints of both hands as extrema of a heatmap.
We use appearance and spatial encodings of these locations as input to a transformer, and leverage the attention mechanisms to sort out the correct configuration of the joints.
arXiv Detail & Related papers (2021-04-29T20:19:20Z) - H2O: Two Hands Manipulating Objects for First Person Interaction
Recognition [70.46638409156772]
We present a comprehensive framework for egocentric interaction recognition using markerless 3D annotations of two hands manipulating objects.
Our method produces annotations of the 3D pose of two hands and the 6D pose of the manipulated objects, along with their interaction labels for each frame.
Our dataset, called H2O (2 Hands and Objects), provides synchronized multi-view RGB-D images, interaction labels, object classes, ground-truth 3D poses for left & right hands, 6D object poses, ground-truth camera poses, object meshes and scene point clouds.
arXiv Detail & Related papers (2021-04-22T17:10:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.