Related papers: Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs

Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs

URL: http://arxiv.org/abs/2509.25139v1
Date: Mon, 29 Sep 2025 17:51:01 GMT
Title: Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs
Authors: Yue Zhang, Tianyi Ma, Zun Wang, Yanyuan Qiao, Parisa Kordjamshidi,
Abstract summary: Existing Vision-and-Language Navigation (VLN) agents encode images as textual scene descriptions.<n>We improve the navigation agent's contextual understanding by incorporating textual descriptions from multiple perspectives.<n>We evaluate our approach on the R2R dataset, where our experiments demonstrate significant improvements in navigation performance.
Score: 41.977702477816756
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Integrating large language models (LLMs) into embodied AI models is becoming increasingly prevalent. However, existing zero-shot LLM-based Vision-and-Language Navigation (VLN) agents either encode images as textual scene descriptions, potentially oversimplifying visual details, or process raw image inputs, which can fail to capture abstract semantics required for high-level reasoning. In this paper, we improve the navigation agent's contextual understanding by incorporating textual descriptions from multiple perspectives that facilitate analogical reasoning across images. By leveraging text-based analogical reasoning, the agent enhances its global scene understanding and spatial reasoning, leading to more accurate action decisions. We evaluate our approach on the R2R dataset, where our experiments demonstrate significant improvements in navigation performance.

Related papers

ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z)
Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving [61.992824291296444]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z)
VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making [17.313485392764353]
VIPER is a novel framework for multimodal instruction-based planning.<n>It integrates VLM-based perception with LLM-based reasoning.<n>We show that VIPER significantly outperforms state-of-the-art visual instruction-based planners.
arXiv Detail & Related papers (2025-03-19T11:05:42Z)
Towards Interpreting Visual Information Processing in Vision-Language Models [24.51408101801313]
Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images.<n>We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM.
arXiv Detail & Related papers (2024-10-09T17:55:02Z)
From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks [33.476693301050275]
We conduct experiments with truncation strategies across various LVLMs for visual question answering and image captioning tasks. By exploring the information flow from the perspective of visual representation contribution, we observe that it tends to converge in shallow layers but diversify in deeper layers.
arXiv Detail & Related papers (2024-06-04T13:52:54Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Image Translation as Diffusion Visual Programmers [52.09889190442439]
Diffusion Visual Programmer (DVP) is a neuro-symbolic image translation framework. Our framework seamlessly embeds a condition-flexible diffusion model within the GPT architecture. Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts.
arXiv Detail & Related papers (2024-01-18T05:50:09Z)
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z)
Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs) Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z)
Exploring Explicit and Implicit Visual Relationships for Image Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning. Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information. Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.