Explore the Potential Performance of Vision-and-Language Navigation
Model: a Snapshot Ensemble Method
- URL: http://arxiv.org/abs/2111.14267v1
- Date: Sun, 28 Nov 2021 23:07:48 GMT
- Title: Explore the Potential Performance of Vision-and-Language Navigation
Model: a Snapshot Ensemble Method
- Authors: Wenda Qin, Teruhisa Misu, Derry Wijaya
- Abstract summary: Vision-and-Language Navigation (VLN) is a challenging task in the field of artificial intelligence.
We provide a new perspective to improve VLN models.
- Score: 6.349841849317769
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision-and-Language Navigation (VLN) is a challenging task in the field of
artificial intelligence. Although massive progress has been made in this task
over the past few years attributed to breakthroughs in deep vision and language
models, it remains tough to build VLN models that can generalize as well as
humans. In this paper, we provide a new perspective to improve VLN models.
Based on our discovery that snapshots of the same VLN model behave
significantly differently even when their success rates are relatively the
same, we propose a snapshot-based ensemble solution that leverages predictions
among multiple snapshots. Constructed on the snapshots of the existing
state-of-the-art (SOTA) model $\circlearrowright$BERT and our past-action-aware
modification, our proposed ensemble achieves the new SOTA performance in the
R2R dataset challenge in Navigation Error (NE) and Success weighted by Path
Length (SPL).
Related papers
- On the Status of Foundation Models for SAR Imagery [10.480790915352255]
We investigate the viability of foundational AI/ML models for Synthetic Aperture Radar (SAR) object recognition tasks.<n>We show that Self-Supervised finetuning of publicly available SSL models with SAR data is a viable path forward.<n>Our experiments further analyze the performance trade-off of using different backbones with different downstream task-adaptation recipes.
arXiv Detail & Related papers (2025-09-26T00:46:17Z) - Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities [31.498539233768334]
We introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots.<n>For the first time, we evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines.<n>Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls.
arXiv Detail & Related papers (2025-07-17T11:46:00Z) - VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning [77.34267241692706]
Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.<n>We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions.
arXiv Detail & Related papers (2025-06-20T17:59:59Z) - Enhanced Continual Learning of Vision-Language Models with Model Fusion [16.764069327701186]
Vision-Language Models (VLMs) represent a breakthrough in artificial intelligence.
VLMs are susceptible to catastrophic forgetting when sequentially fine-tuned on multiple downstream tasks.
We propose Continual Decoupling-Unifying (ConDU), a novel approach, by introducing model fusion into continual learning.
arXiv Detail & Related papers (2025-03-12T15:48:13Z) - Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method [94.74003109176581]
Long-Horizon Vision-Language Navigation (LH-VLN) is a novel VLN task that emphasizes long-term planning and decision consistency across consecutive subtasks.
Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model.
arXiv Detail & Related papers (2024-12-12T09:08:13Z) - ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models.
Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios.
We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For
Vision-and-Language Navigation [6.11362142120604]
Vision-and-language navigation (VLN) is a crucial but challenging cross-modal navigation task.
One powerful technique to enhance the performance in VLN is the use of an independent speaker model to provide pseudo instructions for data augmentation.
We propose a novel term-aware-temporal transformer speaker (PASTS) model that uses transformer as the core of the network.
arXiv Detail & Related papers (2023-05-19T02:25:56Z) - Waypoint Models for Instruction-guided Navigation in Continuous
Environments [68.2912740006109]
We develop a class of language-conditioned waypoint prediction networks to examine this question.
We measure task performance and estimated execution time on a profiled LoCoBot robot.
Our models outperform prior work in VLN-CE and set a new state-of-the-art on the public leaderboard.
arXiv Detail & Related papers (2021-10-05T17:55:49Z) - Airbert: In-domain Pretraining for Vision-and-Language Navigation [91.03849833486974]
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions.
Recent methods explore pretraining to improve generalization of VLN agents.
We introduce BnB, a large-scale and diverse in-domain VLN dataset.
arXiv Detail & Related papers (2021-08-20T10:58:09Z) - A Recurrent Vision-and-Language BERT for Navigation [54.059606864535304]
We propose a recurrent BERT model that is time-aware for use in vision-and-language navigation.
Our model can replace more complex encoder-decoder models to achieve state-of-the-art results.
arXiv Detail & Related papers (2020-11-26T00:23:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.