Related papers: FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks

FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks

URL: http://arxiv.org/abs/2503.13966v1
Date: Tue, 18 Mar 2025 06:58:41 GMT
Title: FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks
Authors: Siqi Zhang, Yanyuan Qiao, Qunbo Wang, Longteng Guo, Zhihua Wei, Jing Liu,
Abstract summary: We propose FlexVLN, an innovative hierarchical approach to Vision-and-Language Navigation (VLN)<n>It integrates the navigation ability of a supervised-learning-based Instruction Follower with the robust generalization ability of the LLM Planner.<n>We take REVERIE, SOON, and CVDN-target as out-of-domain datasets for assessing generalization ability.
Score: 13.969116430006215
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The aspiration of the Vision-and-Language Navigation (VLN) task has long been to develop an embodied agent with robust adaptability, capable of seamlessly transferring its navigation capabilities across various tasks. Despite remarkable advancements in recent years, most methods necessitate dataset-specific training, thereby lacking the capability to generalize across diverse datasets encompassing distinct types of instructions. Large language models (LLMs) have demonstrated exceptional reasoning and generalization abilities, exhibiting immense potential in robot action planning. In this paper, we propose FlexVLN, an innovative hierarchical approach to VLN that integrates the fundamental navigation ability of a supervised-learning-based Instruction Follower with the robust generalization ability of the LLM Planner, enabling effective generalization across diverse VLN datasets. Moreover, a verification mechanism and a multi-model integration mechanism are proposed to mitigate potential hallucinations by the LLM Planner and enhance execution accuracy of the Instruction Follower. We take REVERIE, SOON, and CVDN-target as out-of-domain datasets for assessing generalization ability. The generalization performance of FlexVLN surpasses that of all the previous methods to a large extent.

Related papers

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control [7.626715427413578]
Vision-language-action (VLA) models have shown promise for generalizable robot skills.<n>Current VLA models often focus on scaling the vision-language model (VLM) component, while the action space representation remains a critical bottleneck.<n>This paper introduces DexVLA, a novel framework designed to enhance the efficiency and generalization capabilities ofVLAs for complex, long-horizon tasks.
arXiv Detail & Related papers (2025-02-09T11:25:56Z)
Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z)
Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies. Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors. We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
FLAME: Learning to Navigate with Multimodal LLM in Urban Environments [12.428873051106702]
Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks. LLMs struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models. We introduce FLAME, a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks.
arXiv Detail & Related papers (2024-08-20T17:57:46Z)
OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation [96.46961207887722]
OVER-NAV aims to go over and beyond the current arts of IVLN techniques. To fully exploit the interpreted navigation data, we introduce a structured representation, coded Omnigraph.
arXiv Detail & Related papers (2024-03-26T02:34:48Z)
Towards Learning a Generalist Model for Embodied Navigation [24.816490551945435]
We propose the first generalist model for embodied navigation, NaviLLM. It adapts LLMs to embodied navigation by introducing schema-based instruction. We conduct extensive experiments to evaluate the performance and generalizability of our model.
arXiv Detail & Related papers (2023-12-04T16:32:51Z)
Vision-Language Instruction Tuning: A Review and Analysis [52.218690619616474]
Vision-Language Instruction Tuning (VLIT) presents more complex characteristics compared to pure text instruction tuning. We offer a detailed categorization for existing VLIT datasets and identify the characteristics that high-quality VLIT data should possess. By incorporating these characteristics as guiding principles into the existing VLIT data construction process, we conduct extensive experiments and verify their positive impact on the performance of tuned multi-modal LLMs.
arXiv Detail & Related papers (2023-11-14T14:02:32Z)
The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation [56.61543110071199]
Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset. Previous approaches have attempted to address SFVUDA by leveraging self-supervision derived from the target data itself. We take an approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift.
arXiv Detail & Related papers (2023-08-17T18:12:05Z)
Effective Adaptation in Multi-Task Co-Training for Unified Autonomous Driving [103.745551954983]
In this paper, we investigate the transfer performance of various types of self-supervised methods, including MoCo and SimCLR, on three downstream tasks. We find that their performances are sub-optimal or even lag far behind the single-task baseline. We propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training.
arXiv Detail & Related papers (2022-09-19T12:15:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.