Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans
- URL: http://arxiv.org/abs/2305.13876v3
- Date: Wed, 7 Feb 2024 06:10:12 GMT
- Title: Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans
- Authors: Taiki Miyanishi, Daichi Azuma, Shuhei Kurita, Motoki Kawanabe
- Abstract summary: We present a novel task for cross-dataset visual grounding in 3D scenes (Cross3DVG)
We created RIORefer, a large-scale 3D visual grounding dataset.
It includes more than 63k diverse descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan.
- Score: 6.936271803454143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel task for cross-dataset visual grounding in 3D scenes
(Cross3DVG), which overcomes limitations of existing 3D visual grounding
models, specifically their restricted 3D resources and consequent tendencies of
overfitting a specific 3D dataset. We created RIORefer, a large-scale 3D visual
grounding dataset, to facilitate Cross3DVG. It includes more than 63k diverse
descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan, with
human annotations. After training the Cross3DVG model using the source 3D
visual grounding dataset, we evaluate it without target labels using the target
dataset with, e.g., different sensors, 3D reconstruction methods, and language
annotators. Comprehensive experiments are conducted using established visual
grounding models and with CLIP-based multi-view 2D and 3D integration designed
to bridge gaps among 3D datasets. For Cross3DVG tasks, (i) cross-dataset 3D
visual grounding exhibits significantly worse performance than learning and
evaluation with a single dataset because of the 3D data and language variants
across datasets. Moreover, (ii) better object detector and localization modules
and fusing 3D data and multi-view CLIP-based image features can alleviate this
lower performance. Our Cross3DVG task can provide a benchmark for developing
robust 3D visual grounding models to handle diverse 3D scenes while leveraging
deep language understanding.
Related papers
- AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene.
Existing approaches commonly encounter a shortage of text3D pairs available for training.
We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z) - V-MIND: Building Versatile Monocular Indoor 3D Detector with Diverse 2D Annotations [17.49394091283978]
V-MIND (Versatile Monocular INdoor Detector) enhances the performance of indoor 3D detectors across a diverse set of object classes.
We generate 3D training data by converting large-scale 2D images into 3D point clouds and subsequently deriving pseudo 3D bounding boxes.
V-MIND achieves state-of-the-art object detection performance across a wide range of classes on the Omni3D indoor dataset.
arXiv Detail & Related papers (2024-12-16T03:28:00Z) - SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding [10.81711535075112]
3D Visual Grounding aims to locate objects in 3D scenes based on textual descriptions.
We introduce SeeGround, a zero-shot 3DVG framework leveraging 2D Vision-Language Models (VLMs) trained on large-scale 2D data.
We propose two modules: the Perspective Adaptation Module, which dynamically selects viewpoints for query-relevant image rendering, and the Fusion Alignment Module, which integrates 2D images with 3D spatial descriptions.
arXiv Detail & Related papers (2024-12-05T17:58:43Z) - 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination [22.029496025779405]
3D-GRAND is a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions.
Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs.
As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs.
arXiv Detail & Related papers (2024-06-07T17:59:59Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - 3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images.
We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image.
We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z) - 3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z) - UniG3D: A Unified 3D Object Generation Dataset [75.49544172927749]
UniG3D is a unified 3D object generation dataset constructed by employing a universal data transformation pipeline on ShapeNet datasets.
This pipeline converts each raw 3D model into comprehensive multi-modal data representation.
The selection of data sources for our dataset is based on their scale and quality.
arXiv Detail & Related papers (2023-06-19T07:03:45Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.