VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View
- URL: http://arxiv.org/abs/2307.06082v2
- Date: Wed, 24 Jan 2024 15:10:07 GMT
- Title: VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View
- Authors: Raphael Schumann and Wanrong Zhu and Weixi Feng and Tsu-Jui Fu and
Stefan Riezler and William Yang Wang
- Abstract summary: Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
- Score: 81.58612867186633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Incremental decision making in real-world environments is one of the most
challenging tasks in embodied artificial intelligence. One particularly
demanding scenario is Vision and Language Navigation~(VLN) which requires
visual and natural language understanding as well as spatial and temporal
reasoning capabilities. The embodied agent needs to ground its understanding of
navigation instructions in observations of a real-world environment like Street
View. Despite the impressive results of LLMs in other research areas, it is an
ongoing problem of how to best connect them with an interactive visual
environment. In this work, we propose VELMA, an embodied LLM agent that uses a
verbalization of the trajectory and of visual environment observations as
contextual prompt for the next action. Visual information is verbalized by a
pipeline that extracts landmarks from the human written navigation instructions
and uses CLIP to determine their visibility in the current panorama view. We
show that VELMA is able to successfully follow navigation instructions in
Street View with only two in-context examples. We further finetune the LLM
agent on a few thousand examples and achieve 25%-30% relative improvement in
task completion over the previous state-of-the-art for two datasets.
Related papers
- Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning [48.33405770713208]
We propose an end-to-end framework for aerial VLN tasks, where the large language model (LLM) is introduced as our agent for action prediction.
We develop a novel Semantic-Topo-Metric Representation (STMR) to enhance the spatial reasoning ability of LLMs.
Experiments conducted in real and simulation environments have successfully proved the effectiveness and robustness of our method.
arXiv Detail & Related papers (2024-10-11T03:54:48Z) - Bridging Vision and Language Spaces with Assignment Prediction [47.04855334955006]
VLAP is a novel approach that bridges pretrained vision models and large language models (LLMs)
We harness well-established word embeddings to bridge two modality embedding spaces.
VLAP achieves substantial improvements over the previous linear transformation-based approaches.
arXiv Detail & Related papers (2024-04-15T10:04:15Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - Know What and Know Where: An Object-and-Room Informed Sequential BERT
for Indoor Vision-Language Navigation [120.90387630691816]
Vision-and-Language Navigation (VLN) requires an agent to navigate to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
Most existing methods take words in instructions and discrete views of each panorama as the minimal unit of encoding.
We propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level.
arXiv Detail & Related papers (2021-04-09T02:44:39Z) - Diagnosing Vision-and-Language Navigation: What Really Matters [61.72935815656582]
Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments.
Recent studies witness a slow-down in the performance improvements in both indoor and outdoor VLN tasks.
In this work, we conduct a series of diagnostic experiments to unveil agents' focus during navigation.
arXiv Detail & Related papers (2021-03-30T17:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.