Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions
- URL: http://arxiv.org/abs/2504.04744v1
- Date: Mon, 07 Apr 2025 05:38:23 GMT
- Title: Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions
- Authors: He Zhu, Quyu Kong, Kechun Xu, Xunlong Xia, Bing Deng, Jieping Ye, Rong Xiong, Yue Wang,
- Abstract summary: We introduce a novel task that grounds 3D object affordance based on language instructions, visual observations and interactions.<n>We propose LMAffordance3D, the first multi-modal, language-guided 3D affordance grounding network.
- Score: 43.784500616459304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grounding 3D object affordance is a task that locates objects in 3D space where they can be manipulated, which links perception and action for embodied intelligence. For example, for an intelligent robot, it is necessary to accurately ground the affordance of an object and grasp it according to human instructions. In this paper, we introduce a novel task that grounds 3D object affordance based on language instructions, visual observations and interactions, which is inspired by cognitive science. We collect an Affordance Grounding dataset with Points, Images and Language instructions (AGPIL) to support the proposed task. In the 3D physical world, due to observation orientation, object rotation, or spatial occlusion, we can only get a partial observation of the object. So this dataset includes affordance estimations of objects from full-view, partial-view, and rotation-view perspectives. To accomplish this task, we propose LMAffordance3D, the first multi-modal, language-guided 3D affordance grounding network, which applies a vision-language model to fuse 2D and 3D spatial features with semantic features. Comprehensive experiments on AGPIL demonstrate the effectiveness and superiority of our method on this task, even in unseen experimental settings. Our project is available at https://sites.google.com/view/lmaffordance3d.
Related papers
- GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding [53.42728468191711]
Open-Vocabulary 3D object affordance grounding aims to anticipate action possibilities'' regions on 3D objects with arbitrary instructions.<n>We propose GREAT (GeometRy-intEntion collAboraTive inference) for Open-Vocabulary 3D Object Affordance Grounding.
arXiv Detail & Related papers (2024-11-29T11:23:15Z) - Functionality understanding and segmentation in 3D scenes [6.1744362771344]
We introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes.
Fun3DU uses a language model to parse the task description through Chain-of-Thought reasoning.
We evaluate Fun3DU on SceneFun3D, the most recent and only dataset to benchmark this task.
arXiv Detail & Related papers (2024-11-25T11:57:48Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - AffordanceLLM: Grounding Affordance from Vision Language Models [36.97072698640563]
Affordance grounding refers to the task of finding the area of an object with which one can interact.
Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set.
We make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge.
arXiv Detail & Related papers (2024-01-12T03:21:02Z) - 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding [58.924180772480504]
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.
We propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3-Net)
arXiv Detail & Related papers (2023-07-25T09:33:25Z) - Grounding 3D Object Affordance from 2D Interactions in Images [128.6316708679246]
Grounding 3D object affordance seeks to locate objects' ''action possibilities'' regions in the 3D space.
Humans possess the ability to perceive object affordances in the physical world through demonstration images or videos.
We devise an Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources.
arXiv Detail & Related papers (2023-03-18T15:37:35Z) - Semantic Abstraction: Open-World 3D Scene Understanding from 2D
Vision-Language Models [17.606199768716532]
We study open-world 3D scene understanding, a family of tasks that require agents to reason about their 3D environment with an open-set vocabulary and out-of-domain visual inputs.
We propose Semantic Abstraction (SemAbs), a framework that equips 2D Vision-Language Models with new 3D spatial capabilities.
We demonstrate the usefulness of SemAbs on two open-world 3D scene understanding tasks.
arXiv Detail & Related papers (2022-07-23T13:10:25Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.