Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans
- URL: http://arxiv.org/abs/2305.13876v3
- Date: Wed, 7 Feb 2024 06:10:12 GMT
- Title: Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans
- Authors: Taiki Miyanishi, Daichi Azuma, Shuhei Kurita, Motoki Kawanabe
- Abstract summary: We present a novel task for cross-dataset visual grounding in 3D scenes (Cross3DVG)
We created RIORefer, a large-scale 3D visual grounding dataset.
It includes more than 63k diverse descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan.
- Score: 6.936271803454143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel task for cross-dataset visual grounding in 3D scenes
(Cross3DVG), which overcomes limitations of existing 3D visual grounding
models, specifically their restricted 3D resources and consequent tendencies of
overfitting a specific 3D dataset. We created RIORefer, a large-scale 3D visual
grounding dataset, to facilitate Cross3DVG. It includes more than 63k diverse
descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan, with
human annotations. After training the Cross3DVG model using the source 3D
visual grounding dataset, we evaluate it without target labels using the target
dataset with, e.g., different sensors, 3D reconstruction methods, and language
annotators. Comprehensive experiments are conducted using established visual
grounding models and with CLIP-based multi-view 2D and 3D integration designed
to bridge gaps among 3D datasets. For Cross3DVG tasks, (i) cross-dataset 3D
visual grounding exhibits significantly worse performance than learning and
evaluation with a single dataset because of the 3D data and language variants
across datasets. Moreover, (ii) better object detector and localization modules
and fusing 3D data and multi-view CLIP-based image features can alleviate this
lower performance. Our Cross3DVG task can provide a benchmark for developing
robust 3D visual grounding models to handle diverse 3D scenes while leveraging
deep language understanding.
Related papers
- E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model [23.56751925900571]
The development of 3D medical vision-language models holds significant potential for disease diagnosis and patient treatment.
We utilize self-supervised learning to construct a 3D visual foundation model for extracting 3D visual features.
We apply 3D spatial convolutions to aggregate and project high-level image features, reducing computational complexity.
Our model demonstrates superior performance compared to existing methods in report generation, visual question answering, and disease diagnosis.
arXiv Detail & Related papers (2024-10-18T06:31:40Z) - 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination [22.029496025779405]
3D-GRAND is a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions.
Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs.
As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs.
arXiv Detail & Related papers (2024-06-07T17:59:59Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - 3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images.
We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image.
We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z) - Uni3D: Exploring Unified 3D Representation at Scale [66.26710717073372]
We present Uni3D, a 3D foundation model to explore the unified 3D representation at scale.
Uni3D uses a 2D ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features.
We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild.
arXiv Detail & Related papers (2023-10-10T16:49:21Z) - 3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z) - UniG3D: A Unified 3D Object Generation Dataset [75.49544172927749]
UniG3D is a unified 3D object generation dataset constructed by employing a universal data transformation pipeline on ShapeNet datasets.
This pipeline converts each raw 3D model into comprehensive multi-modal data representation.
The selection of data sources for our dataset is based on their scale and quality.
arXiv Detail & Related papers (2023-06-19T07:03:45Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.