Detailed 2D-3D Joint Representation for Human-Object Interaction
- URL: http://arxiv.org/abs/2004.08154v2
- Date: Thu, 21 May 2020 04:51:52 GMT
- Title: Detailed 2D-3D Joint Representation for Human-Object Interaction
- Authors: Yong-Lu Li, Xinpeng Liu, Han Lu, Shiyi Wang, Junqi Liu, Jiefeng Li,
Cewu Lu
- Abstract summary: We propose a detailed 2D-3D joint representation learning method for HOI learning.
First, we utilize the single-view human body capture method to obtain detailed 3D body, face and hand shapes.
Next, we estimate the 3D object location and size with reference to the 2D human-object spatial configuration and object category priors.
- Score: 45.71407935014447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-Object Interaction (HOI) detection lies at the core of action
understanding. Besides 2D information such as human/object appearance and
locations, 3D pose is also usually utilized in HOI learning since its
view-independence. However, rough 3D body joints just carry sparse body
information and are not sufficient to understand complex interactions. Thus, we
need detailed 3D body shape to go further. Meanwhile, the interacted object in
3D is also not fully studied in HOI learning. In light of these, we propose a
detailed 2D-3D joint representation learning method. First, we utilize the
single-view human body capture method to obtain detailed 3D body, face and hand
shapes. Next, we estimate the 3D object location and size with reference to the
2D human-object spatial configuration and object category priors. Finally, a
joint learning framework and cross-modal consistency tasks are proposed to
learn the joint HOI representation. To better evaluate the 2D ambiguity
processing capacity of models, we propose a new benchmark named Ambiguous-HOI
consisting of hard ambiguous images. Extensive experiments in large-scale HOI
benchmark and Ambiguous-HOI show impressive effectiveness of our method. Code
and data are available at https://github.com/DirtyHarryLYL/DJ-RN.
Related papers
- CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from
Unbounded Synthesized Images [10.4286198282079]
We present a method for teaching machines to understand and model the underlying spatial common sense of diverse human-object interactions in 3D.
We show multiple 2D images captured from different viewpoints when humans interact with the same type of objects.
Despite its imperfection of the image quality over real images, we demonstrate that the synthesized images are sufficient to learn the 3D human-object spatial relations.
arXiv Detail & Related papers (2023-08-23T17:59:11Z) - Tracking Objects with 3D Representation from Videos [57.641129788552675]
We propose a new 2D Multiple Object Tracking paradigm, called P3DTrack.
With 3D object representation learning from Pseudo 3D object labels in monocular videos, we propose a new 2D MOT paradigm, called P3DTrack.
arXiv Detail & Related papers (2023-06-08T17:58:45Z) - TANDEM3D: Active Tactile Exploration for 3D Object Recognition [16.548376556543015]
We propose TANDEM3D, a method that applies a co-training framework for 3D object recognition with tactile signals.
TANDEM3D is based on a novel encoder that builds 3D object representation from contact positions and normals using PointNet++.
Our method is trained entirely in simulation and validated with real-world experiments.
arXiv Detail & Related papers (2022-09-19T05:54:26Z) - Gait Recognition in the Wild with Dense 3D Representations and A
Benchmark [86.68648536257588]
Existing studies for gait recognition are dominated by 2D representations like the silhouette or skeleton of the human body in constrained scenes.
This paper aims to explore dense 3D representations for gait recognition in the wild.
We build the first large-scale 3D representation-based gait recognition dataset, named Gait3D.
arXiv Detail & Related papers (2022-04-06T03:54:06Z) - GRAB: A Dataset of Whole-Body Human Grasping of Objects [53.00728704389501]
Training computers to understand human grasping requires a rich dataset containing complex 3D object shapes, detailed contact information, hand pose and shape, and the 3D body motion over time.
We collect a new dataset, called GRAB, of whole-body grasps, containing full 3D shape and pose sequences of 10 subjects interacting with 51 everyday objects of varying shape and size.
This is a unique dataset, that goes well beyond existing ones for modeling and understanding how humans grasp and manipulate objects, how their full body is involved, and how interaction varies with the task.
arXiv Detail & Related papers (2020-08-25T17:57:55Z) - Interactive Annotation of 3D Object Geometry using 2D Scribbles [84.51514043814066]
In this paper, we propose an interactive framework for annotating 3D object geometry from point cloud data and RGB imagery.
Our framework targets naive users without artistic or graphics expertise.
arXiv Detail & Related papers (2020-08-24T21:51:29Z) - Parameter-Efficient Person Re-identification in the 3D Space [51.092669618679615]
We project 2D images to a 3D space and introduce a novel parameter-efficient Omni-scale Graph Network (OG-Net) to learn the pedestrian representation directly from 3D point clouds.
OG-Net effectively exploits the local information provided by sparse 3D points and takes advantage of the structure and appearance information in a coherent manner.
We are among the first attempts to conduct person re-identification in the 3D space.
arXiv Detail & Related papers (2020-06-08T13:20:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.