Related papers: Bird's-Eye-View Scene Graph for Vision-Language Navigation

Bird's-Eye-View Scene Graph for Vision-Language Navigation

URL: http://arxiv.org/abs/2308.04758v2
Date: Sat, 12 Aug 2023 08:29:16 GMT
Title: Bird's-Eye-View Scene Graph for Vision-Language Navigation
Authors: Rui Liu, Xiaohan Wang, Wenguan Wang, Yi Yang
Abstract summary: Vision-language navigation (VLN) entails an agent to navigate 3D environments following human instructions. We present a BEV Scene Graph (BSG), which leverages multi-step BEV representations to encode scene layouts and geometric cues of indoor environment. Based on BSG, the agent predicts a local BEV grid-level decision score and a global graph-level decision score, combined with a sub-view selection score on panoramic views.
Score: 85.72725920024578
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language navigation (VLN), which entails an agent to navigate 3D environments following human instructions, has shown great advances. However, current agents are built upon panoramic observations, which hinders their ability to perceive 3D scene geometry and easily leads to ambiguous selection of panoramic view. To address these limitations, we present a BEV Scene Graph (BSG), which leverages multi-step BEV representations to encode scene layouts and geometric cues of indoor environment under the supervision of 3D detection. During navigation, BSG builds a local BEV representation at each step and maintains a BEV-based global scene map, which stores and organizes all the online collected local BEV representations according to their topological relations. Based on BSG, the agent predicts a local BEV grid-level decision score and a global graph-level decision score, combined with a sub-view selection score on panoramic views, for more accurate action prediction. Our approach significantly outperforms state-of-the-art methods on REVERIE, R2R, and R4R, showing the potential of BEV perception in VLN.

Related papers

BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving [3.061835990893183]
We introduce BEV-LLM, a lightweight model for 3D captioning of autonomous driving scenes.<n>Despite using a small 1B parameter base model, BEV-LLM achieves competitive performance on the nuCaption dataset.<n>We release two new datasets to better assess scene captioning across diverse driving scenarios.
arXiv Detail & Related papers (2025-07-25T15:22:56Z)
VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization [108.68014173017583]
Bird's-eye-view (BEV) map layout estimation requires an accurate and full understanding of the semantics for the environmental elements around the ego car. We propose to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space. Thanks to the obtained BEV tokens accompanied with a codebook embedding encapsulating the semantics for different BEV elements in the groundtruth maps, we are able to directly align the sparse backbone image features with the obtained BEV tokens
arXiv Detail & Related papers (2024-11-03T16:09:47Z)
Navigation Instruction Generation with BEV Perception and Large Language Models [60.455964599187205]
We propose BEVInstructor, which incorporates Bird's Eye View (BEV) features into Multi-Modal Large Language Models (MLLMs) for instruction generation. Specifically, BEVInstructor constructs a PerspectiveBEV for the comprehension of 3D environments through fusing BEV and perspective features. Based on the perspective-BEV prompts, BEVInstructor further adopts an instance-guided iterative refinement pipeline, which improves the instructions in a progressive manner.
arXiv Detail & Related papers (2024-07-21T08:05:29Z)
Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions. We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells. VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z)
DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception [104.87876441265593]
Camera-only Bird's Eye View (BEV) has demonstrated great potential in environment perception in a 3D space. Unsupervised domain adaptive BEV, which effective learning from various unlabelled target data, is far under-explored. We design DA-BEV, the first domain adaptive camera-only BEV framework that addresses domain adaptive BEV challenges by exploiting the complementary nature of image-view features and BEV features.
arXiv Detail & Related papers (2024-01-13T04:21:24Z)
BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks [28.024042528077125]
Bird's-Eye-View (BEV) 3D Object Detection is a crucial multi-view technique for autonomous driving systems. We propose a novel method named BEV Slice Attention Network (BEV-SAN) for exploiting the intrinsic characteristics of different heights.
arXiv Detail & Related papers (2022-12-02T15:14:48Z)
GitNet: Geometric Prior-based Transformation for Birds-Eye-View Segmentation [105.19949897812494]
Birds-eye-view (BEV) semantic segmentation is critical for autonomous driving. We present a novel two-stage Geometry Prior-based Transformation framework named GitNet.
arXiv Detail & Related papers (2022-04-16T06:46:45Z)
Bird's-Eye-View Panoptic Segmentation Using Monocular Frontal View Images [4.449481309681663]
We present the first end-to-end learning approach for directly predicting dense panoptic segmentation maps in the Bird's-Eye-View (BEV) maps. Our architecture follows the top-down paradigm and incorporates a novel dense transformer module. We derive a mathematical formulation for the sensitivity of the FV-BEV transformation which allows us to intelligently weight pixels in the BEV space.
arXiv Detail & Related papers (2021-08-06T17:59:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.