Related papers: \textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation

\textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation

URL: http://arxiv.org/abs/2601.18188v1
Date: Mon, 26 Jan 2026 06:16:17 GMT
Title: \textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation
Authors: Weiye Zhu, Zekai Zhang, Xiangchen Wang, Hewei Pan, Teng Wang, Tiantian Geng, Rongtao Xu, Feng Zheng,
Abstract summary: textscNaVIDA is a unified VLN framework that couples policy learning with action-grounded visual dynamics and adaptive execution.<n>textscNaVIDA augments training with chunk-based inverse-dynamics supervision to learn causal relationship between visual changes and corresponding actions.<n>Experiments show that textscNaVIDA achieves superior navigation performance compared to state-of-the-art methods with fewer parameters.
Score: 50.027425808733994
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions and act coherently in visually rich environments. However, most existing methods rely on reactive state-action mappings without explicitly modeling how actions causally transform subsequent visual observations. Lacking such vision-action causality, agents cannot anticipate the visual changes induced by its own actions, leading to unstable behaviors, weak generalization, and cumulative error along trajectory. To address these issues, we introduce \textsc{NaVIDA} (\textbf{Nav}igation with \textbf{I}nverse \textbf{D}ynamics \textbf{A}ugmentation), a unified VLN framework that couples policy learning with action-grounded visual dynamics and adaptive execution. \textsc{NaVIDA} augments training with chunk-based inverse-dynamics supervision to learn causal relationship between visual changes and corresponding actions. To structure this supervision and extend the effective planning range, \textsc{NaVIDA} employs hierarchical probabilistic action chunking (HPAC), which organizes trajectories into multi-step chunks and provides discriminative, longer-range visual-change cues. To further curb error accumulation and stabilize behavior at inference, an entropy-guided mechanism adaptively sets the execution horizon of action chunks. Extensive experiments show that \textsc{NaVIDA} achieves superior navigation performance compared to state-of-the-art methods with fewer parameters (3B vs. 8B). Real-world robot evaluations further validate the practical feasibility and effectiveness of our approach. Code and data will be available upon acceptance.

Related papers

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models [21.133970394496327]
Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control.<n>Current test-time scaling (TTS) methods require additional training, verifiers, and multiple forward passes, making them impractical for deployment.<n>We propose a simple inference strategy that jointly modulates visual perception and action based on'self-uncertainty'
arXiv Detail & Related papers (2026-02-04T04:48:16Z)
ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance [50.05984919728878]
We present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations.<n>Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations.<n>To evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop.
arXiv Detail & Related papers (2026-01-23T11:31:07Z)
Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z)
Learning to Act Robustly with View-Invariant Latent Actions [8.446887947386559]
Vision-based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations.<n>We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics.<n> Experiments in both simulation and the real world show that VILA-based policies generalize effectively to unseen viewpoints and transfer well to new tasks.
arXiv Detail & Related papers (2026-01-06T13:14:01Z)
PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention [92.85371254435074]
PosA-VLA framework anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions.<n>We show that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks.
arXiv Detail & Related papers (2025-12-03T12:14:29Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [73.75271615101754]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
Fine-Grained Alignment in Vision-and-Language Navigation through Bayesian Optimization [20.608059199982094]
This paper addresses the challenge of fine-grained alignment in Vision-and-Language Navigation (VLN) tasks.<n>Current approaches use contrastive learning to align language with visual trajectory sequences.<n>We introduce a novel Bayesian Optimization-based adversarial optimization framework for creating fine-grained contrastive vision samples.
arXiv Detail & Related papers (2024-11-22T09:12:02Z)
Causality-Aware Transformer Networks for Robotic Navigation [13.719643934968367]
Current research in Visual Navigation reveals opportunities for improvement. Direct adoption of RNNs and Transformers often overlooks the specific differences between Embodied AI and traditional sequential data modelling. We propose Causality-Aware Transformer (CAT) Networks for Navigation, featuring a Causal Understanding Module.
arXiv Detail & Related papers (2024-09-04T12:53:26Z)
Narrowing the Gap between Vision and Action in Navigation [28.753809306008996]
We introduce a low-level action decoder jointly trained with high-level action prediction. Our agent can improve navigation performance metrics compared to the strong baselines on both high-level and low-level actions.
arXiv Detail & Related papers (2024-08-19T20:09:56Z)
Generalization in Visual Reinforcement Learning with the Reward Sequence Distribution [98.67737684075587]
Generalization in partially observed markov decision processes (POMDPs) is critical for successful applications of visual reinforcement learning (VRL) We propose the reward sequence distribution conditioned on the starting observation and the predefined subsequent action sequence (RSD-OA) Experiments demonstrate that our representation learning approach based on RSD-OA significantly improves the generalization performance on unseen environments.
arXiv Detail & Related papers (2023-02-19T15:47:24Z)
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling [65.99956848461915]
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal.<n>One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments.<n>We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data.
arXiv Detail & Related papers (2019-11-17T18:02:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.