Grounding 3D Object Affordance from 2D Interactions in Images
- URL: http://arxiv.org/abs/2303.10437v2
- Date: Wed, 9 Aug 2023 07:11:11 GMT
- Title: Grounding 3D Object Affordance from 2D Interactions in Images
- Authors: Yuhang Yang, Wei Zhai, Hongchen Luo, Yang Cao, Jiebo Luo, Zheng-Jun
Zha
- Abstract summary: Grounding 3D object affordance seeks to locate objects' ''action possibilities'' regions in the 3D space.
Humans possess the ability to perceive object affordances in the physical world through demonstration images or videos.
We devise an Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources.
- Score: 128.6316708679246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grounding 3D object affordance seeks to locate objects' ''action
possibilities'' regions in the 3D space, which serves as a link between
perception and operation for embodied agents. Existing studies primarily focus
on connecting visual affordances with geometry structures, e.g. relying on
annotations to declare interactive regions of interest on the object and
establishing a mapping between the regions and affordances. However, the
essence of learning object affordance is to understand how to use it, and the
manner that detaches interactions is limited in generalization. Normally,
humans possess the ability to perceive object affordances in the physical world
through demonstration images or videos. Motivated by this, we introduce a novel
task setting: grounding 3D object affordance from 2D interactions in images,
which faces the challenge of anticipating affordance through interactions of
different sources. To address this problem, we devise a novel
Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the
region feature of objects from different sources and models the interactive
contexts for 3D object affordance grounding. Besides, we collect a Point-Image
Affordance Dataset (PIAD) to support the proposed task. Comprehensive
experiments on PIAD demonstrate the reliability of the proposed task and the
superiority of our method. The project is available at
https://github.com/yyvhang/IAGNet.
Related papers
- Grounding 3D Scene Affordance From Egocentric Interactions [52.5827242925951]
Grounding 3D scene affordance aims to locate interactive regions in 3D environments.
We introduce a novel task: grounding 3D scene affordance from egocentric interactions.
arXiv Detail & Related papers (2024-09-29T10:46:19Z) - Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding [46.05283810364663]
We introduce the textbfMulti-textbfImage Guided Invariant-textbfFeature-Aware 3D textbfAffordance textbfGrounding framework.
It grounds 3D object affordance regions by identifying common interaction patterns across multiple human-object interaction images.
arXiv Detail & Related papers (2024-08-23T12:27:33Z) - AffordanceLLM: Grounding Affordance from Vision Language Models [36.97072698640563]
Affordance grounding refers to the task of finding the area of an object with which one can interact.
Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set.
We make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge.
arXiv Detail & Related papers (2024-01-12T03:21:02Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding [58.924180772480504]
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.
We propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3-Net)
arXiv Detail & Related papers (2023-07-25T09:33:25Z) - Generating Visual Spatial Description via Holistic 3D Scene
Understanding [88.99773815159345]
Visual spatial description (VSD) aims to generate texts that describe the spatial relations of the given objects within images.
With an external 3D scene extractor, we obtain the 3D objects and scene features for input images.
We construct a target object-centered 3D spatial scene graph (Go3D-S2G), such that we model the spatial semantics of target objects within the holistic 3D scenes.
arXiv Detail & Related papers (2023-05-19T15:53:56Z) - Language Conditioned Spatial Relation Reasoning for 3D Object Grounding [87.03299519917019]
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations.
We propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
arXiv Detail & Related papers (2022-11-17T16:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.