Related papers: B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding

B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding

URL: http://arxiv.org/abs/2510.10194v1
Date: Sat, 11 Oct 2025 12:17:12 GMT
Title: B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding
Authors: Feng Xiao, Hongbin Xu, Hai Ci, Wenxiong Kang,
Abstract summary: We propose a novel progressive relational learning framework for 3D object grounding.<n>We extend relational learning from binary to n-ary to identify visual relations that match the referential description globally.<n>In the scene graph created with n-ary relationships, we use a multi-modal network with hybrid attention mechanisms to further localize the target.
Score: 32.814585321361974
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Localizing 3D objects using natural language is essential for robotic scene understanding. The descriptions often involve multiple spatial relationships to distinguish similar objects, making 3D-language alignment difficult. Current methods only model relationships for pairwise objects, ignoring the global perceptual significance of n-ary combinations in multi-modal relational understanding. To address this, we propose a novel progressive relational learning framework for 3D object grounding. We extend relational learning from binary to n-ary to identify visual relations that match the referential description globally. Given the absence of specific annotations for referred objects in the training data, we design a grouped supervision loss to facilitate n-ary relational learning. In the scene graph created with n-ary relationships, we use a multi-modal network with hybrid attention mechanisms to further localize the target within the n-ary combinations. Experiments and ablation studies on the ReferIt3D and ScanRefer benchmarks demonstrate that our method outperforms the state-of-the-art, and proves the advantages of the n-ary relational perception in 3D localization.

Related papers

Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions [28.185661905201222]
Descrip3D is a novel framework that explicitly encodes the relationships between objects using natural language.<n>It allows for unified reasoning across various tasks such as grounding, captioning, and question answering.
arXiv Detail & Related papers (2025-07-19T09:19:16Z)
LSVG: Language-Guided Scene Graphs with 2D-Assisted Multi-Modal Encoding for 3D Visual Grounding [15.944945244005952]
3D visual grounding aims to localize the unique target described by natural languages in 3D scenes.<n>We propose a novel 3D visual grounding framework that constructs language-guided scene graphs with referred object discrimination.
arXiv Detail & Related papers (2025-05-07T02:02:15Z)
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.<n>We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.<n>We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z)
AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene.<n>Existing approaches commonly encounter a shortage of text3D pairs available for training.<n>We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z)
Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns. A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z)
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.<n>The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z)
SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer. Our method replaces original language-independent encoding with cross-modal encoding in visual analysis. Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z)
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding [58.924180772480504]
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description. We propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3-Net)
arXiv Detail & Related papers (2023-07-25T09:33:25Z)
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding [87.03299519917019]
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations. We propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
arXiv Detail & Related papers (2022-11-17T16:42:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.