Looking Outside the Box to Ground Language in 3D Scenes
- URL: http://arxiv.org/abs/2112.08879v2
- Date: Sun, 19 Dec 2021 12:15:30 GMT
- Title: Looking Outside the Box to Ground Language in 3D Scenes
- Authors: Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina
Fragkiadaki
- Abstract summary: We propose a model for grounding language in 3D scenes with three main innovations.
Iterative attention across the language stream, the point cloud feature stream and 3D box proposals.
Joint supervision from 3D object annotations and language grounding annotations.
When applied on language grounding on 2D images with minor changes, it performs on par with the state-of-the-art while converges in half of the GPU time.
- Score: 27.126171549887232
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing language grounding models often use object proposal bottlenecks: a
pre-trained detector proposes objects in the scene and the model learns to
select the answer from these box proposals, without attending to the original
image or 3D point cloud. Object detectors are typically trained on a fixed
vocabulary of objects and attributes that is often too restrictive for
open-domain language grounding, where an utterance may refer to visual entities
at various levels of abstraction, such as a chair, the leg of a chair, or the
tip of the front leg of a chair. We propose a model for grounding language in
3D scenes that bypasses box proposal bottlenecks with three main innovations:
i) Iterative attention across the language stream, the point cloud feature
stream and 3D box proposals. ii) Transformer decoders with non-parametric
entity queries that decode 3D boxes for object and part referentials. iii)
Joint supervision from 3D object annotations and language grounding
annotations, by treating object detection as grounding of referential
utterances comprised of a list of candidate category labels. These innovations
result in significant quantitative gains (up to +9% absolute improvement on the
SR3D benchmark) over previous approaches on popular 3D language grounding
benchmarks. We ablate each of our innovations to show its contribution to the
performance of the model. When applied on language grounding on 2D images with
minor changes, it performs on par with the state-of-the-art while converges in
half of the GPU time. The code and checkpoints will be made available at
https://github.com/nickgkan/beauty_detr
Related papers
- Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - Dense Object Grounding in 3D Scenes [28.05720194887322]
Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamental yet important task in the field of multimedia understanding.
We introduce 3D Dense Object Grounding (3D DOG), to jointly localize multiple objects described in a more complicated paragraph rather than a single sentence.
Our proposed 3DOGSFormer outperforms state-of-the-art 3D single-object grounding methods and their dense-object variants by significant margins.
arXiv Detail & Related papers (2023-09-05T13:27:19Z) - Language Conditioned Spatial Relation Reasoning for 3D Object Grounding [87.03299519917019]
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations.
We propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
arXiv Detail & Related papers (2022-11-17T16:42:39Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.