Hand-Object Contact Prediction via Motion-Based Pseudo-Labeling and
Guided Progressive Label Correction
- URL: http://arxiv.org/abs/2110.10174v1
- Date: Tue, 19 Oct 2021 18:00:02 GMT
- Title: Hand-Object Contact Prediction via Motion-Based Pseudo-Labeling and
Guided Progressive Label Correction
- Authors: Takuma Yagi, Md Tasnimul Hasan, Yoichi Sato
- Abstract summary: We introduce a video-based method for predicting contact between a hand and an object.
Annotating a large number of hand-object tracks and contact labels is costly.
We propose a semi-supervised framework consisting of (i) automatic collection of training data with motion-based pseudo-labels and (ii) guided progressive label correction (gPLC)
- Score: 27.87570749976023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Every hand-object interaction begins with contact. Despite predicting the
contact state between hands and objects is useful in understanding hand-object
interactions, prior methods on hand-object analysis have assumed that the
interacting hands and objects are known, and were not studied in detail. In
this study, we introduce a video-based method for predicting contact between a
hand and an object. Specifically, given a video and a pair of hand and object
tracks, we predict a binary contact state (contact or no-contact) for each
frame. However, annotating a large number of hand-object tracks and contact
labels is costly. To overcome the difficulty, we propose a semi-supervised
framework consisting of (i) automatic collection of training data with
motion-based pseudo-labels and (ii) guided progressive label correction (gPLC),
which corrects noisy pseudo-labels with a small amount of trusted data. We
validated our framework's effectiveness on a newly built benchmark dataset for
hand-object contact prediction and showed superior performance against existing
baseline methods. Code and data are available at
https://github.com/takumayagi/hand_object_contact_prediction.
Related papers
- ClickDiff: Click to Induce Semantic Contact Map for Controllable Grasp Generation with Diffusion Models [17.438429495623755]
ClickDiff is a controllable conditional generation model that leverages a fine-grained Semantic Contact Map.
Within this framework, the Semantic Conditional Module generates reasonable contact maps based on fine-grained contact information.
We evaluate the validity of our proposed method, demonstrating the efficacy and robustness of ClickDiff, even with previously unseen objects.
arXiv Detail & Related papers (2024-07-28T02:42:29Z) - Novel-view Synthesis and Pose Estimation for Hand-Object Interaction
from Sparse Views [41.50710846018882]
We propose a neural rendering and pose estimation system for hand-object interaction from sparse views.
We first learn the shape and appearance prior knowledge of hands and objects separately with the neural representation.
During the online stage, we design a rendering-based joint model fitting framework to understand the dynamic hand-object interaction.
arXiv Detail & Related papers (2023-08-22T05:17:41Z) - Learning Explicit Contact for Implicit Reconstruction of Hand-held
Objects from Monocular Images [59.49985837246644]
We show how to model contacts in an explicit way to benefit the implicit reconstruction of hand-held objects.
In the first part, we propose a new subtask of directly estimating 3D hand-object contacts from a single image.
In the second part, we introduce a novel method to diffuse estimated contact states from the hand mesh surface to nearby 3D space.
arXiv Detail & Related papers (2023-05-31T17:59:26Z) - ContactArt: Learning 3D Interaction Priors for Category-level Articulated Object and Hand Poses Estimation [46.815231896011284]
We propose a new dataset and a novel approach to learning hand-object interaction priors for hand and articulated object pose estimation.
We first collect a dataset using visual teleoperation, where the human operator can directly play within a physical simulator to manipulate the articulated objects.
Our system only requires an iPhone to record human hand motion, which can be easily scaled up and largely lower the costs of data and annotation collection.
arXiv Detail & Related papers (2023-05-02T17:24:08Z) - S$^2$Contact: Graph-based Network for 3D Hand-Object Contact Estimation
with Semi-Supervised Learning [70.72037296392642]
We propose a novel semi-supervised framework that allows us to learn contact from monocular images.
Specifically, we leverage visual and geometric consistency constraints in large-scale datasets for generating pseudo-labels.
We show benefits from using a contact map that rules hand-object interactions to produce more accurate reconstructions.
arXiv Detail & Related papers (2022-08-01T14:05:23Z) - TOCH: Spatio-Temporal Object-to-Hand Correspondence for Motion
Refinement [42.3418874174372]
We present TOCH, a method for refining incorrect 3D hand-object interaction sequences using a data prior.
We learn a latent manifold of plausible TOCH fields with a temporal denoising auto-encoder.
Experiments demonstrate that TOCH outperforms state-of-the-art 3D hand-object interaction models.
arXiv Detail & Related papers (2022-05-16T20:41:45Z) - ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation [68.80339307258835]
ARCTIC is a dataset of two hands that dexterously manipulate objects.
It contains 2.1M video frames paired with accurate 3D hand meshes and detailed, dynamic contact information.
arXiv Detail & Related papers (2022-04-28T17:23:59Z) - H2O: Two Hands Manipulating Objects for First Person Interaction
Recognition [70.46638409156772]
We present a comprehensive framework for egocentric interaction recognition using markerless 3D annotations of two hands manipulating objects.
Our method produces annotations of the 3D pose of two hands and the 6D pose of the manipulated objects, along with their interaction labels for each frame.
Our dataset, called H2O (2 Hands and Objects), provides synchronized multi-view RGB-D images, interaction labels, object classes, ground-truth 3D poses for left & right hands, 6D object poses, ground-truth camera poses, object meshes and scene point clouds.
arXiv Detail & Related papers (2021-04-22T17:10:42Z) - "What's This?" -- Learning to Segment Unknown Objects from Manipulation
Sequences [27.915309216800125]
We present a novel framework for self-supervised grasped object segmentation with a robotic manipulator.
We propose a single, end-to-end trainable architecture which jointly incorporates motion cues and semantic knowledge.
Our method neither depends on any visual registration of a kinematic robot or 3D object models, nor on precise hand-eye calibration or any additional sensor data.
arXiv Detail & Related papers (2020-11-06T10:55:28Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.