Multi3DRefer: Grounding Text Description to Multiple 3D Objects
- URL: http://arxiv.org/abs/2309.05251v1
- Date: Mon, 11 Sep 2023 06:03:39 GMT
- Title: Multi3DRefer: Grounding Text Description to Multiple 3D Objects
- Authors: Yiming Zhang, ZeMing Gong, Angel X. Chang
- Abstract summary: We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions.
Our dataset contains 61926 descriptions of 11609 objects, where zero, single or multiple target objects are referenced by each description.
We develop a better baseline leveraging 2D features from CLIP by rendering proposals online with contrastive learning, which outperforms the state of the art on the ScanRefer benchmark.
- Score: 15.54885309441946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the task of localizing a flexible number of objects in
real-world 3D scenes using natural language descriptions. Existing 3D visual
grounding tasks focus on localizing a unique object given a text description.
However, such a strict setting is unnatural as localizing potentially multiple
objects is a common need in real-world scenarios and robotic tasks (e.g.,
visual navigation and object rearrangement). To address this setting we propose
Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains
61926 descriptions of 11609 objects, where zero, single or multiple target
objects are referenced by each description. We also introduce a new evaluation
metric and benchmark methods from prior work to enable further investigation of
multi-modal 3D scene understanding. Furthermore, we develop a better baseline
leveraging 2D features from CLIP by rendering object proposals online with
contrastive learning, which outperforms the state of the art on the ScanRefer
benchmark.
Related papers
- MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Dense Object Grounding in 3D Scenes [28.05720194887322]
Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamental yet important task in the field of multimedia understanding.
We introduce 3D Dense Object Grounding (3D DOG), to jointly localize multiple objects described in a more complicated paragraph rather than a single sentence.
Our proposed 3DOGSFormer outperforms state-of-the-art 3D single-object grounding methods and their dense-object variants by significant margins.
arXiv Detail & Related papers (2023-09-05T13:27:19Z) - Anything-3D: Towards Single-view Anything Reconstruction in the Wild [61.090129285205805]
We introduce Anything-3D, a methodical framework that ingeniously combines a series of visual-language models and the Segment-Anything object segmentation model.
Our approach employs a BLIP model to generate textural descriptions, utilize the Segment-Anything model for the effective extraction of objects of interest, and leverages a text-to-image diffusion model to lift object into a neural radiance field.
arXiv Detail & Related papers (2023-04-19T16:39:51Z) - WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language [31.691159120136064]
We introduce the task of 3D visual grounding in large-scale dynamic scenes based on natural linguistic descriptions and online captured multi-modal visual data.
We present a novel method, dubbed WildRefer, for this task by fully utilizing the rich appearance information in images, the position and geometric clues in point cloud.
Our datasets are significant for the research of 3D visual grounding in the wild and has huge potential to boost the development of autonomous driving and service robots.
arXiv Detail & Related papers (2023-04-12T06:48:26Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.