LanguageRefer: Spatial-Language Model for 3D Visual Grounding
- URL: http://arxiv.org/abs/2107.03438v1
- Date: Wed, 7 Jul 2021 18:55:03 GMT
- Title: LanguageRefer: Spatial-Language Model for 3D Visual Grounding
- Authors: Junha Roh, Karthik Desingh, Ali Farhadi, Dieter Fox
- Abstract summary: We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
- Score: 72.7618059299306
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: To realize robots that can understand human instructions and perform
meaningful tasks in the near future, it is important to develop learned models
that can understand referential language to identify common objects in
real-world 3D scenes. In this paper, we develop a spatial-language model for a
3D visual grounding problem. Specifically, given a reconstructed 3D scene in
the form of a point cloud with 3D bounding boxes of potential object
candidates, and a language utterance referring to a target object in the scene,
our model identifies the target object from a set of potential candidates. Our
spatial-language model uses a transformer-based architecture that combines
spatial embedding from bounding-box with a finetuned language embedding from
DistilBert and reasons among the objects in the 3D scene to find the target
object. We show that our model performs competitively on visio-linguistic
datasets proposed by ReferIt3D. We provide additional analysis of performance
in spatial reasoning tasks decoupled from perception noise, the effect of
view-dependent utterances in terms of accuracy, and view-point annotations for
potential robotics applications.
Related papers
- SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - Paparazzi: A Deep Dive into the Capabilities of Language and Vision
Models for Grounding Viewpoint Descriptions [4.026600887656479]
We investigate whether a state-of-the-art language and vision model, CLIP, is able to ground perspective descriptions of a 3D object.
We present an evaluation framework that uses a circling camera around a 3D object to generate images from different viewpoints.
We find that a pre-trained CLIP model performs poorly on most canonical views.
arXiv Detail & Related papers (2023-02-13T15:18:27Z) - Language Conditioned Spatial Relation Reasoning for 3D Object Grounding [87.03299519917019]
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations.
We propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
arXiv Detail & Related papers (2022-11-17T16:42:39Z) - Looking Outside the Box to Ground Language in 3D Scenes [27.126171549887232]
We propose a model for grounding language in 3D scenes with three main innovations.
Iterative attention across the language stream, the point cloud feature stream and 3D box proposals.
Joint supervision from 3D object annotations and language grounding annotations.
When applied on language grounding on 2D images with minor changes, it performs on par with the state-of-the-art while converges in half of the GPU time.
arXiv Detail & Related papers (2021-12-16T13:50:23Z) - Language Grounding with 3D Objects [60.67796160959387]
We introduce a novel reasoning task that targets both visual and non-visual language about 3D objects.
We introduce several CLIP-based models for distinguishing objects.
We find that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform.
arXiv Detail & Related papers (2021-07-26T23:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.