Self-supervised 3D Semantic Representation Learning for
Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2201.10788v1
- Date: Wed, 26 Jan 2022 07:43:47 GMT
- Title: Self-supervised 3D Semantic Representation Learning for
Vision-and-Language Navigation
- Authors: Sinan Tan, Mengmeng Ge, Di Guo, Huaping Liu and Fuchun Sun
- Abstract summary: We develop a novel training framework to encode the voxel-level 3D semantic reconstruction into a 3D semantic representation.
We construct an LSTM-based navigation model and train it with the proposed 3D semantic representations and BERT language features on vision-language pairs.
Experiments show that the proposed approach achieves success rates of 68% and 66% on the validation unseen and test unseen splits of the R2R dataset.
- Score: 30.429893959096752
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In the Vision-and-Language Navigation task, the embodied agent follows
linguistic instructions and navigates to a specific goal. It is important in
many practical scenarios and has attracted extensive attention from both
computer vision and robotics communities. However, most existing works only use
RGB images but neglect the 3D semantic information of the scene. To this end,
we develop a novel self-supervised training framework to encode the voxel-level
3D semantic reconstruction into a 3D semantic representation. Specifically, a
region query task is designed as the pretext task, which predicts the presence
or absence of objects of a particular class in a specific 3D region. Then, we
construct an LSTM-based navigation model and train it with the proposed 3D
semantic representations and BERT language features on vision-language pairs.
Experiments show that the proposed approach achieves success rates of 68% and
66% on the validation unseen and test unseen splits of the R2R dataset
respectively, which are superior to most of RGB-based methods utilizing
vision-language transformers.
Related papers
- RefMask3D: Language-Guided Transformer for 3D Referring Segmentation [32.11635464720755]
RefMask3D aims to explore the comprehensive multi-modal feature interaction and understanding.
RefMask3D outperforms previous state-of-the-art method by a large margin of 3.16% mIoU on the challenging ScanRefer dataset.
arXiv Detail & Related papers (2024-07-25T17:58:03Z) - 3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance [68.8825501902835]
3DSS-VLG is a weakly supervised approach for 3D Semantic with 2D Vision-Language Guidance.
To the best of our knowledge, this is the first work to investigate 3D weakly supervised semantic segmentation by using the textual semantic information of text category labels.
arXiv Detail & Related papers (2024-07-13T09:39:11Z) - Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image [70.02187124865627]
Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene.
We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes.
We demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection.
arXiv Detail & Related papers (2024-07-07T04:50:04Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding [2.517953665531978]
We introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks.
Our representation achieves the best visual quality and language querying accuracy across current language-embedded representations.
arXiv Detail & Related papers (2023-11-30T11:50:07Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - Vision-Language Pre-training with Object Contrastive Learning for 3D
Scene Understanding [47.48443919164377]
A vision-language pre-training framework is proposed to transfer flexibly on 3D vision-language downstream tasks.
In this paper, we investigate three common tasks in semantic 3D scene understanding, and derive key insights into a pre-training model.
Experiments verify the excellent performance of the framework on three 3D vision-language tasks.
arXiv Detail & Related papers (2023-05-18T05:25:40Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.