3EED: Ground Everything Everywhere in 3D
- URL: http://arxiv.org/abs/2511.01755v1
- Date: Mon, 03 Nov 2025 17:05:22 GMT
- Title: 3EED: Ground Everything Everywhere in 3D
- Authors: Rong Li, Yuhao Dong, Tianshuai Hu, Ao Liang, Youquan Liu, Dongyue Lu, Liang Pan, Lingdong Kong, Junwei Liang, Ziwei Liu,
- Abstract summary: 3EED is a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms.<n>We provide over 128,000 objects and 22,000 validated referring expressions across diverse outdoor scenes -- 10x larger than existing datasets.
- Score: 64.28938045702077
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We provide over 128,000 objects and 22,000 validated referring expressions across diverse outdoor scenes -- 10x larger than existing datasets. We develop a scalable annotation pipeline combining vision-language model prompting with human verification to ensure high-quality spatial grounding. To support cross-platform learning, we propose platform-aware normalization and cross-modal alignment techniques, and establish benchmark protocols for in-domain and cross-platform evaluations. Our findings reveal significant performance gaps, highlighting the challenges and opportunities of generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released to advance future research in language-driven 3D embodied perception.
Related papers
- Perspective-Invariant 3D Object Detection [6.7936328711743]
We introduce Pi3DET, the first benchmark featuring LiDAR data and 3D bounding box annotations collected from multiple platforms.<n>We propose a novel cross-platform adaptation framework that transfers knowledge from the well-studied vehicle platform to other platforms.<n>We establish a benchmark to evaluate the resilience and robustness of current 3D detectors in cross-platform scenarios.
arXiv Detail & Related papers (2025-07-23T16:29:57Z) - SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes [105.8644620467576]
We introduce Stextscurprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes.<n>Stextscurprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2.<n>The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name.
arXiv Detail & Related papers (2025-07-10T14:01:24Z) - Zero-Shot 3D Visual Grounding from Vision-Language Models [10.81711535075112]
3D Visual Grounding (3DVG) seeks to locate target objects in 3D scenes using natural language descriptions.<n>We present SeeGround, a zero-shot 3DVG framework that leverages 2D Vision-Language Models (VLMs) to bypass the need for 3D-specific training.
arXiv Detail & Related papers (2025-05-28T14:53:53Z) - IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes [10.139461308573336]
IRef-VLA is the largest real-world dataset for the referential grounding task consisting of over 11.5K scanned 3D rooms.<n>We aim to provide a resource for 3D scene understanding that aids the development of robust, interactive navigation systems.
arXiv Detail & Related papers (2025-03-20T16:16:10Z) - AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene.<n>Existing approaches commonly encounter a shortage of text3D pairs available for training.<n>We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.<n>The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans [6.936271803454143]
We present a novel task for cross-dataset visual grounding in 3D scenes (Cross3DVG)
We created RIORefer, a large-scale 3D visual grounding dataset.
It includes more than 63k diverse descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan.
arXiv Detail & Related papers (2023-05-23T09:52:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.