Paparazzi: A Deep Dive into the Capabilities of Language and Vision
Models for Grounding Viewpoint Descriptions
- URL: http://arxiv.org/abs/2302.10282v1
- Date: Mon, 13 Feb 2023 15:18:27 GMT
- Title: Paparazzi: A Deep Dive into the Capabilities of Language and Vision
Models for Grounding Viewpoint Descriptions
- Authors: Henrik Voigt, Jan Hombeck, Monique Meuschke, Kai Lawonn, Sina
Zarrie{\ss}
- Abstract summary: We investigate whether a state-of-the-art language and vision model, CLIP, is able to ground perspective descriptions of a 3D object.
We present an evaluation framework that uses a circling camera around a 3D object to generate images from different viewpoints.
We find that a pre-trained CLIP model performs poorly on most canonical views.
- Score: 4.026600887656479
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing language and vision models achieve impressive performance in
image-text understanding. Yet, it is an open question to what extent they can
be used for language understanding in 3D environments and whether they
implicitly acquire 3D object knowledge, e.g. about different views of an
object. In this paper, we investigate whether a state-of-the-art language and
vision model, CLIP, is able to ground perspective descriptions of a 3D object
and identify canonical views of common objects based on text queries. We
present an evaluation framework that uses a circling camera around a 3D object
to generate images from different viewpoints and evaluate them in terms of
their similarity to natural language descriptions. We find that a pre-trained
CLIP model performs poorly on most canonical views and that fine-tuning using
hard negative sampling and random contrasting yields good results even under
conditions with little available training data.
Related papers
- Functionality understanding and segmentation in 3D scenes [6.1744362771344]
We introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes.
Fun3DU uses a language model to parse the task description through Chain-of-Thought reasoning.
We evaluate Fun3DU on SceneFun3D, the most recent and only dataset to benchmark this task.
arXiv Detail & Related papers (2024-11-25T11:57:48Z) - Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image [70.02187124865627]
Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene.
We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes.
We demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection.
arXiv Detail & Related papers (2024-07-07T04:50:04Z) - Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment [26.858034573776198]
We propose a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment.
Our 3D-VLA exploits the superior ability of current large-scale vision-language models on aligning the semantics between texts and 2D images.
During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - CLIP-Guided Vision-Language Pre-training for Question Answering in 3D
Scenes [68.61199623705096]
We design a novel 3D pre-training Vision-Language method that helps a model learn semantically meaningful and transferable 3D scene point cloud representations.
We inject the representational power of the popular CLIP model into our 3D encoder by aligning the encoded 3D scene features with the corresponding 2D image and text embeddings.
We evaluate our model's 3D world reasoning capability on the downstream task of 3D Visual Question Answering.
arXiv Detail & Related papers (2023-04-12T16:52:29Z) - Language Grounding with 3D Objects [60.67796160959387]
We introduce a novel reasoning task that targets both visual and non-visual language about 3D objects.
We introduce several CLIP-based models for distinguishing objects.
We find that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform.
arXiv Detail & Related papers (2021-07-26T23:35:58Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.