InterCap: Joint Markerless 3D Tracking of Humans and Objects in
Interaction
- URL: http://arxiv.org/abs/2209.12354v1
- Date: Mon, 26 Sep 2022 00:46:49 GMT
- Title: InterCap: Joint Markerless 3D Tracking of Humans and Objects in
Interaction
- Authors: Yinghao Huang (1), Omid Tehari (1), Michael J. Black (1), Dimitrios
Tzionas (2) ((1) Max Planck Institute for Intelligent Systems, T\"ubingen,
Germany, (2) University of Amsterdam, Amsterdam, The Netherlands)
- Abstract summary: InterCap reconstructs whole-bodies and objects from multi-view RGB-D data.
Azure Kinect sensors allow us to set up a simple multi-view RGB-D capture system.
InterCap has 223 RGB-D videos, resulting in 67,357 multi-view frames, each containing 6 RGB-D images.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans constantly interact with daily objects to accomplish tasks. To
understand such interactions, computers need to reconstruct these from cameras
observing whole-body interaction with scenes. This is challenging due to
occlusion between the body and objects, motion blur, depth/scale ambiguities,
and the low image resolution of hands and graspable object parts. To make the
problem tractable, the community focuses either on interacting hands, ignoring
the body, or on interacting bodies, ignoring hands. The GRAB dataset addresses
dexterous whole-body interaction but uses marker-based MoCap and lacks images,
while BEHAVE captures video of body object interaction but lacks hand detail.
We address the limitations of prior work with InterCap, a novel method that
reconstructs interacting whole-bodies and objects from multi-view RGB-D data,
using the parametric whole-body model SMPL-X and known object meshes. To tackle
the above challenges, InterCap uses two key observations: (i) Contact between
the hand and object can be used to improve the pose estimation of both. (ii)
Azure Kinect sensors allow us to set up a simple multi-view RGB-D capture
system that minimizes the effect of occlusion while providing reasonable
inter-camera synchronization. With this method we capture the InterCap dataset,
which contains 10 subjects (5 males and 5 females) interacting with 10 objects
of various sizes and affordances, including contact with the hands or feet. In
total, InterCap has 223 RGB-D videos, resulting in 67,357 multi-view frames,
each containing 6 RGB-D images. Our method provides pseudo ground-truth body
meshes and objects for each video frame. Our InterCap method and dataset fill
an important gap in the literature and support many research directions. Our
data and code are areavailable for research purposes.
Related papers
- Articulated 3D Scene Graphs for Open-World Mobile Manipulation [55.97942733699124]
We present MoMa-SG, a framework for building semantic-kinematic 3D scene graphs of articulated scenes.<n>We estimate articulation models using a novel unified twist estimation formulation.<n>We also introduce the novel Arti4D-Semantic dataset.
arXiv Detail & Related papers (2026-02-18T10:40:35Z) - InteractMove: Text-Controlled Human-Object Interaction Generation in 3D Scenes with Movable Objects [15.92165183796286]
We propose a novel task of text-controlled human object interaction generation in 3D scenes with movable objects.<n>Existing human-scene interaction datasets suffer from insufficient interaction categories.<n>We propose a hand-object joint affordance learning to predict contact regions for different hand joints.
arXiv Detail & Related papers (2025-09-28T03:29:15Z) - Generalizable Articulated Object Reconstruction from Casually Captured RGBD Videos [53.47352228180637]
We focus on reconstruction of an articulated object from a casually captured RGBD video shot with a hand-held camera.<n>A casual captured video of an interaction with an articulated object is easy to acquire at scale using smartphones.<n>We introduce a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video.
arXiv Detail & Related papers (2025-06-10T01:41:46Z) - HUMOTO: A 4D Dataset of Mocap Human Object Interactions [27.573065832588554]
Human Motions with Objects is a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications.
Humoto captures interactions with 63 precisely modeled objects and 72 articulated parts.
Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations.
arXiv Detail & Related papers (2025-04-14T16:59:29Z) - PickScan: Object discovery and reconstruction from handheld interactions [99.99566882133179]
We develop an interaction-guided and class-agnostic method to reconstruct 3D representations of scenes.
Our main contribution is a novel approach to detecting user-object interactions and extracting the masks of manipulated objects.
Compared to Co-Fusion, the only comparable interaction-based and class-agnostic baseline, this corresponds to a reduction in chamfer distance of 73%.
arXiv Detail & Related papers (2024-11-17T23:09:08Z) - Dense Hand-Object(HO) GraspNet with Full Grasping Taxonomy and Dynamics [43.30868393851785]
HOGraspNet is a training dataset for 3D hand-object interaction.
The dataset includes diverse hand shapes from 99 participants aged 10 to 74.
It offers labels for 3D hand and object meshes, 3D keypoints, contact maps, and emphgrasp labels
arXiv Detail & Related papers (2024-09-06T05:49:38Z) - A New People-Object Interaction Dataset and NVS Benchmarks [16.909004722367644]
We introduce a new people-object interaction dataset that comprises 38 series of 30-view multi-person or single-person RGB-D video sequences.
Video sequences are captured by 30 Kinect Azures, uniformly surrounding the scene, each in 4K resolution 25 FPS, and lasting for 1$sim$19 seconds.
arXiv Detail & Related papers (2024-09-03T08:54:15Z) - InterTracker: Discovering and Tracking General Objects Interacting with
Hands in the Wild [40.489171608114574]
Existing methods rely on frame-based detectors to locate interacting objects.
We propose to leverage hand-object interaction to track interactive objects.
Our proposed method outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-06T09:09:17Z) - Full-Body Articulated Human-Object Interaction [61.01135739641217]
CHAIRS is a large-scale motion-captured f-AHOI dataset consisting of 16.2 hours of versatile interactions.
CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process.
By learning the geometrical relationships in HOI, we devise the very first model that leverage human pose estimation.
arXiv Detail & Related papers (2022-12-20T19:50:54Z) - BEHAVE: Dataset and Method for Tracking Human Object Interactions [105.77368488612704]
We present the first full body human- object interaction dataset with multi-view RGBD frames and corresponding 3D SMPL and object fits along with the annotated contacts between them.
We use this data to learn a model that can jointly track humans and objects in natural environments with an easy-to-use portable multi-camera setup.
arXiv Detail & Related papers (2022-04-14T13:21:19Z) - EgoBody: Human Body Shape, Motion and Social Interactions from
Head-Mounted Devices [76.50816193153098]
EgoBody is a novel large-scale dataset for social interactions in complex 3D scenes.
We employ Microsoft HoloLens2 headsets to record rich egocentric data streams including RGB, depth, eye gaze, head and hand tracking.
To obtain accurate 3D ground-truth, we calibrate the headset with a multi-Kinect rig and fit expressive SMPL-X body meshes to multi-view RGB-D frames.
arXiv Detail & Related papers (2021-12-14T18:41:28Z) - Estimating 3D Motion and Forces of Human-Object Interactions from
Internet Videos [49.52070710518688]
We introduce a method to reconstruct the 3D motion of a person interacting with an object from a single RGB video.
Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces on the human body.
arXiv Detail & Related papers (2021-11-02T13:40:18Z) - H2O: Two Hands Manipulating Objects for First Person Interaction
Recognition [70.46638409156772]
We present a comprehensive framework for egocentric interaction recognition using markerless 3D annotations of two hands manipulating objects.
Our method produces annotations of the 3D pose of two hands and the 6D pose of the manipulated objects, along with their interaction labels for each frame.
Our dataset, called H2O (2 Hands and Objects), provides synchronized multi-view RGB-D images, interaction labels, object classes, ground-truth 3D poses for left & right hands, 6D object poses, ground-truth camera poses, object meshes and scene point clouds.
arXiv Detail & Related papers (2021-04-22T17:10:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.