Know What and Know Where: An Object-and-Room Informed Sequential BERT
for Indoor Vision-Language Navigation
- URL: http://arxiv.org/abs/2104.04167v1
- Date: Fri, 9 Apr 2021 02:44:39 GMT
- Title: Know What and Know Where: An Object-and-Room Informed Sequential BERT
for Indoor Vision-Language Navigation
- Authors: Yuankai Qi, Zizheng Pan, Yicong Hong, Ming-Hsuan Yang, Anton van den
Hengel, Qi Wu
- Abstract summary: Vision-and-Language Navigation (VLN) requires an agent to navigate to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
Most existing methods take words in instructions and discrete views of each panorama as the minimal unit of encoding.
We propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level.
- Score: 120.90387630691816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-Language Navigation (VLN) requires an agent to navigate to a
remote location on the basis of natural-language instructions and a set of
photo-realistic panoramas. Most existing methods take words in instructions and
discrete views of each panorama as the minimal unit of encoding. However, this
requires a model to match different textual landmarks in instructions (e.g.,
TV, table) against the same view feature. In this work, we propose an
object-informed sequential BERT to encode visual perceptions and linguistic
instructions at the same fine-grained level, namely objects and words, to
facilitate the matching between visual and textual entities and hence "know
what". Our sequential BERT enables the visual-textual clues to be interpreted
in light of the temporal context, which is crucial to multi-round VLN tasks.
Additionally, we enable the model to identify the relative direction (e.g.,
left/right/front/back) of each navigable location and the room type (e.g.,
bedroom, kitchen) of its current and final navigation goal, namely "know
where", as such information is widely mentioned in instructions implying the
desired next and final locations. Extensive experiments demonstrate the
effectiveness compared against several state-of-the-art methods on three indoor
VLN tasks: REVERIE, NDH, and R2R.
Related papers
- Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts [37.20272055902246]
Vision-and-Language Navigation with Multi-modal Prompts (VLN-MP) is a novel task augmenting traditional VLN by integrating both natural language and images in instructions.
VLN-MP not only maintains backward compatibility by effectively handling text-only prompts but also consistently shows advantages with different quantities and relevance of visual prompts.
arXiv Detail & Related papers (2024-06-04T11:06:13Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
Attention for Vision-and-Language Navigation [52.65506307440127]
We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation.
We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
arXiv Detail & Related papers (2023-05-26T17:15:22Z) - SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language
Navigation [57.12508968239015]
This work presents a transformer-based vision-and-language navigation (VLN) agent.
It uses two different visual encoders -- a scene classification network and an object detector.
Scene features contribute high-level contextual information that supports object-level processing.
arXiv Detail & Related papers (2021-10-27T03:29:34Z) - Improving Cross-Modal Alignment in Vision Language Navigation via
Syntactic Information [83.62098382773266]
Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions.
We propose a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes.
Our agent achieves the new state-of-the-art on Room-Across-Room dataset, which contains instructions in 3 languages.
arXiv Detail & Related papers (2021-04-19T19:18:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.