VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning
- URL: http://arxiv.org/abs/2506.17221v2
- Date: Wed, 25 Jun 2025 06:03:22 GMT
- Title: VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning
- Authors: Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, Hengshuang Zhao,
- Abstract summary: Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.<n>We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions.
- Score: 77.34267241692706
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions. Current language model-based navigation systems operate on discrete topological graphs, limiting path planning to predefined node connections. We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions, adopting GRPO-based training inspired by DeepSeek-R1. To enable effective training, we first construct the VLN-Ego dataset using a 3D simulator, Habitat, and propose Long-Short Memory Sampling to balance historical and current observations. While large language models can supervise complete textual instructions, they lack fine-grained action-level control. Our framework employs a two-stage training approach: a) Supervised fine-tuning (SFT) to align the model's action sequence text predictions with expert demonstrations, followed by b) Reinforcement fine-tuning (RFT) enhanced with a Time-Decayed Reward (TDR) mechanism that strategically weights multi-step future actions. Experimental results show VLN-R1 achieves strong performance on VLN-CE benchmark. VLN-R1 proves LVLMs can drive embodied navigation and enhance task-specific reasoning through data-efficient, reward-driven post-training.
Related papers
- Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation [52.11339614452127]
Vision-and-Language Navigation (VLN) requires an agent to dynamically explore complex 3D environments following human instructions.<n>Recent research underscores the potential of harnessing large language models (LLMs) for VLN, given their commonsense knowledge and general reasoning capabilities.<n>We propose a novel dual-process thinking framework dubbed R3, integrating LLMs' generalization capabilities with VLN-specific expertise in a zero-shot manner.
arXiv Detail & Related papers (2025-11-18T04:32:00Z) - VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation [52.00474922315126]
We present VLN-Zero, a vision-language navigation framework for unseen environments.<n>We use vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation.<n>VLN-Zero achieves 2x higher success rate compared to state-of-the-art zero-shot models, outperforms most fine-tuned baselines, and reaches goal locations in half the time.
arXiv Detail & Related papers (2025-09-23T03:23:03Z) - ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation [57.399685080574756]
Existing MLLM-based VLN methods rely on imitation learning (IL) and often use DAgger for post-training.<n>We propose ActiveVLN, a VLN framework that explicitly enables active exploration through multi-turn RL.<n>Experiments show that ActiveVLN achieves the largest performance gains over IL baselines compared to both DAgger-based and prior RL-based post-training methods.
arXiv Detail & Related papers (2025-09-16T03:31:46Z) - EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation [111.0993686148283]
We propose a novel sElf-improving embodied reasoning framework for boosting Vision-Language Navigation, dubbed EvolveNav.<n>Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to activate the model's navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity.
arXiv Detail & Related papers (2025-06-02T11:28:32Z) - Vision-and-Language Navigation Generative Pretrained Transformer [0.0]
Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT)
Adopts transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules.
Performance assessments on the VLN dataset reveal that VLN-GPT surpasses complex state-of-the-art encoder-based models.
arXiv Detail & Related papers (2024-05-27T09:42:04Z) - Continual Vision-and-Language Navigation [18.20829279972436]
Vision-and-Language Navigation (VLN) agents navigate to a destination using natural language instructions and visual cues.<n>We propose Continual Vision-and-Language Navigation (CVLN) paradigm for agents to continually learn and adapt to changing environments.
arXiv Detail & Related papers (2024-03-22T09:15:36Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [97.88246428240872]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.<n>Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.<n>This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN)
We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map.
Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z) - ULN: Towards Underspecified Vision-and-Language Navigation [77.81257404252132]
Underspecified vision-and-Language Navigation (ULN) is a new setting for vision-and-Language Navigation (VLN)
We propose a VLN framework that consists of a classification module, a navigation agent, and an Exploitation-to-Exploration (E2E) module.
Our framework is more robust and outperforms the baselines on ULN by 10% relative success rate across all levels.
arXiv Detail & Related papers (2022-10-18T17:45:06Z) - Reinforced Structured State-Evolution for Vision-Language Navigation [42.46176089721314]
Vision-and-language Navigation (VLN) task requires an embodied agent to navigate to a remote location following a natural language instruction.
Previous methods usually adopt a sequence model (e.g., Transformer and LSTM) as the navigator.
We propose a novel Structured state-Evolution (SEvol) model to effectively maintain the environment layout clues for VLN.
arXiv Detail & Related papers (2022-04-20T07:51:20Z) - Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.