Related papers: ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

URL: http://arxiv.org/abs/2501.01366v1
Date: Thu, 02 Jan 2025 17:20:41 GMT
Title: ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding
Authors: Austin T. Wang, ZeMing Gong, Angel X. Chang,
Abstract summary: 3D visual grounding involves localizing entities in a 3D scene referred to by natural language text.<n>We introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns.
Score: 9.289977174410824
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using natural language descriptions. While recent works have focused on LLM-based scaling of 3DVG datasets, these datasets do not capture the full range of potential prompts which could be specified in the English language. To ensure that we are scaling up and testing against a useful and representative set of prompts, we propose a framework for linguistically analyzing 3DVG prompts and introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns. We evaluate existing open-vocabulary 3DVG methods to demonstrate that these methods are not yet proficient in understanding and identifying the targets of more challenging, out-of-distribution prompts, toward real-world applications.

Related papers

A Review of 3D Object Detection with Vision-Language Models [0.31457219084519]
We provide the first systematic analysis dedicated to 3D object detection with vision-language models. Traditional approaches using point clouds and voxel grids are compared to modern vision-language frameworks like CLIP and 3D LLMs. We highlight current challenges, such as limited 3D-language datasets and computational demands.
arXiv Detail & Related papers (2025-04-25T23:27:26Z)
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes [10.139461308573336]
IRef-VLA is the largest real-world dataset for the referential grounding task consisting of over 11.5K scanned 3D rooms. We aim to provide a resource for 3D scene understanding that aids the development of robust, interactive navigation systems.
arXiv Detail & Related papers (2025-03-20T16:16:10Z)
AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene. Existing approaches commonly encounter a shortage of text3D pairs available for training. We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z)
g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks [62.74304008688472]
Generalizable 3D-Language Feature Fields (g3D-LF) is a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks.
arXiv Detail & Related papers (2024-11-26T01:54:52Z)
Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image [70.02187124865627]
Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes. We demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection.
arXiv Detail & Related papers (2024-07-07T04:50:04Z)
Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z)
Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding [57.64806066986975]
3D Visual Grounding aims at localizing 3D object based on textual descriptions. We propose a novel visual programming approach for zero-shot open-vocabulary 3DVG.
arXiv Detail & Related papers (2023-11-26T19:01:14Z)
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language [31.691159120136064]
We introduce the task of 3D visual grounding in large-scale dynamic scenes based on natural linguistic descriptions and online captured multi-modal visual data. We present a novel method, dubbed WildRefer, for this task by fully utilizing the rich appearance information in images, the position and geometric clues in point cloud. Our datasets are significant for the research of 3D visual grounding in the wild and has huge potential to boost the development of autonomous driving and service robots.
arXiv Detail & Related papers (2023-04-12T06:48:26Z)
LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.