Related papers: MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding

MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding

URL: http://arxiv.org/abs/2403.03077v3
Date: Thu, 21 Mar 2024 20:37:54 GMT
Title: MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding
Authors: Chun-Peng Chang, Shaoxiang Wang, Alain Pagani, Didier Stricker,
Abstract summary: 3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries. We present the MiKASA Transformer, which integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets.
Score: 12.462336116108572
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries, particularly with descriptions that involve multiple anchors or are view-dependent. In response, we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique, enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore, MiKASA improves the explainability of decision-making, facilitating error diagnosis. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets, particularly excelling by a large margin in categories that require viewpoint-dependent descriptions.

Related papers

IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction. We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images. We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z)
MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [87.30919771444117]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning. Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation. We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z)
AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene. Existing approaches commonly encounter a shortage of text3D pairs available for training. We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z)
3D Spatial Understanding in MLLMs: Disambiguation and Evaluation [13.614206918726314]
We propose techniques to enhance the model's ability to localize and disambiguate target objects. Our approach achieves state-of-the-art performance on conventional metrics that evaluate sentence similarity.
arXiv Detail & Related papers (2024-12-09T16:04:32Z)
g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks [62.74304008688472]
Generalizable 3D-Language Feature Fields (g3D-LF) is a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks.
arXiv Detail & Related papers (2024-11-26T01:54:52Z)
Multimodal 3D Reasoning Segmentation with Complex Scenes [92.92045550692765]
We bridge the research gaps by proposing a 3D reasoning segmentation task for multiple objects in scenes. The task allows producing 3D segmentation masks and detailed textual explanations as enriched by 3D spatial relations among objects. In addition, we design MORE3D, a simple yet effective method that enables multi-object 3D reasoning segmentation with user questions and textual outputs.
arXiv Detail & Related papers (2024-11-21T08:22:45Z)
RefMask3D: Language-Guided Transformer for 3D Referring Segmentation [32.11635464720755]
RefMask3D aims to explore the comprehensive multi-modal feature interaction and understanding. RefMask3D outperforms previous state-of-the-art method by a large margin of 3.16% mIoU on the challenging ScanRefer dataset.
arXiv Detail & Related papers (2024-07-25T17:58:03Z)
SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR. SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds. We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z)
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z)
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z)
Dense Object Grounding in 3D Scenes [28.05720194887322]
Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamental yet important task in the field of multimedia understanding. We introduce 3D Dense Object Grounding (3D DOG), to jointly localize multiple objects described in a more complicated paragraph rather than a single sentence. Our proposed 3DOGSFormer outperforms state-of-the-art 3D single-object grounding methods and their dense-object variants by significant margins.
arXiv Detail & Related papers (2023-09-05T13:27:19Z)
Viewpoint Equivariance for Multi-View 3D Object Detection [35.4090127133834]
State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input. We introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry.
arXiv Detail & Related papers (2023-03-25T19:56:41Z)
LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.