Related papers: Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations

Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations

URL: http://arxiv.org/abs/2409.05552v2
Date: Mon, 07 Apr 2025 12:15:04 GMT
Title: Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations
Authors: Xuesong Zhang, Jia Li, Yunbo Xu, Zhenzhen Hu, Richang Hong,
Abstract summary: We investigate whether advanced VLN models genuinely comprehend the visual content of their environments.<n>Surprisingly, we experimentally find that simple branch expansion, even with noisy visual inputs, paradoxically improves the navigational efficacy.<n>We present a versatile Multi-Branch Architecture (MBA) designed to delve into the impact of both the branch quantity and visual quality.
Score: 41.5875455113941
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous navigation guided by natural language instructions in embodied environments remains a challenge for vision-language navigation (VLN) agents. Although recent advancements in learning diverse and fine-grained visual environmental representations have shown promise, the fragile performance improvements may not conclusively attribute to enhanced visual grounding,a limitation also observed in related vision-language tasks. In this work, we preliminarily investigate whether advanced VLN models genuinely comprehend the visual content of their environments by introducing varying levels of visual perturbations. These perturbations include ground-truth depth images, perturbed views and random noise. Surprisingly, we experimentally find that simple branch expansion, even with noisy visual inputs, paradoxically improves the navigational efficacy. Inspired by these insights, we further present a versatile Multi-Branch Architecture (MBA) designed to delve into the impact of both the branch quantity and visual quality. The proposed MBA extends a base agent into a multi-branch variant, where each branch processes a different visual input. This approach is embarrassingly simple yet agnostic to topology-based VLN agents. Extensive experiments on three VLN benchmarks (R2R, REVERIE, SOON) demonstrate that our method with optimal visual permutations matches or even surpasses state-of-the-art results. The source code is available at here.

Related papers

UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation [71.97405667493477]
We introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN. It enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features. UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
arXiv Detail & Related papers (2024-11-25T02:44:59Z)
Fine-Grained Alignment in Vision-and-Language Navigation through Bayesian Optimization [20.608059199982094]
This paper addresses the challenge of fine-grained alignment in Vision-and-Language Navigation (VLN) tasks. Current approaches use contrastive learning to align language with visual trajectory sequences. We introduce a novel Bayesian Optimization-based adversarial optimization framework for creating fine-grained contrastive vision samples.
arXiv Detail & Related papers (2024-11-22T09:12:02Z)
Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z)
Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z)
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z)
GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation [52.65506307440127]
We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation. We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
arXiv Detail & Related papers (2023-05-26T17:15:22Z)
TriPINet: Tripartite Progressive Integration Network for Image Manipulation Localization [3.7359400978194675]
We propose a tripartite progressive integration network (TriPINet) for end-to-end image manipulation localization. We develop a guided cross-modality dual-attention (gCMDA) module to fuse different types of forged clues. Extensive experiments are conducted to compare our method with state-of-the-art image forensics approaches.
arXiv Detail & Related papers (2022-12-25T02:27:58Z)
Diagnosing Vision-and-Language Navigation: What Really Matters [61.72935815656582]
Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments. Recent studies witness a slow-down in the performance improvements in both indoor and outdoor VLN tasks. In this work, we conduct a series of diagnostic experiments to unveil agents' focus during navigation.
arXiv Detail & Related papers (2021-03-30T17:59:07Z)
Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting [105.5303416210736]
Unsupervised machine translation (MT) has recently achieved impressive results with monolingual corpora only. It is still challenging to associate source-target sentences in the latent space. As people speak different languages biologically share similar visual systems, the potential of achieving better alignment through visual content is promising.
arXiv Detail & Related papers (2020-05-06T20:11:46Z)
Diagnosing the Environment Bias in Vision-and-Language Navigation [102.02103792590076]
Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations. Recent works that study VLN observe a significant performance drop when tested on unseen environments, indicating that the neural agent models are highly biased towards training environments. In this work, we design novel diagnosis experiments via environment re-splitting and feature replacement, looking into possible reasons for this environment bias.
arXiv Detail & Related papers (2020-05-06T19:24:33Z)
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling [65.99956848461915]
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal. One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments. We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data.
arXiv Detail & Related papers (2019-11-17T18:02:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.