VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation
- URL: http://arxiv.org/abs/2512.19021v1
- Date: Mon, 22 Dec 2025 04:27:26 GMT
- Title: VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation
- Authors: Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang, Rong Wei, Rui Tang, Juncheng Li, Hanqing Wang, Jiangmiao Pang, Anton van den Hengel, Jiajun Liu, Qi Wu,
- Abstract summary: We introduce VLNVerse, a benchmark for Versatile, Embodied, Realistic Simulation, and Evaluation.<n>VLNVerse redefines VLN as a scalable, full-stack embodied AI problem.<n>We propose a novel unified multi-task model capable of addressing all tasks within the benchmark.
- Score: 61.82502719679122
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite remarkable progress in Vision-Language Navigation (VLN), existing benchmarks remain confined to fixed, small-scale datasets with naive physical simulation. These shortcomings limit the insight that the benchmarks provide into sim-to-real generalization, and create a significant research gap. Furthermore, task fragmentation prevents unified/shared progress in the area, while limited data scales fail to meet the demands of modern LLM-based pretraining. To overcome these limitations, we introduce VLNVerse: a new large-scale, extensible benchmark designed for Versatile, Embodied, Realistic Simulation, and Evaluation. VLNVerse redefines VLN as a scalable, full-stack embodied AI problem. Its Versatile nature unifies previously fragmented tasks into a single framework and provides an extensible toolkit for researchers. Its Embodied design moves beyond intangible and teleporting "ghost" agents that support full-kinematics in a Realistic Simulation powered by a robust physics engine. We leverage the scale and diversity of VLNVerse to conduct a comprehensive Evaluation of existing methods, from classic models to MLLM-based agents. We also propose a novel unified multi-task model capable of addressing all tasks within the benchmark. VLNVerse aims to narrow the gap between simulated navigation and real-world generalization, providing the community with a vital tool to boost research towards scalable, general-purpose embodied locomotion agents.
Related papers
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z) - SimVLA: A Simple VLA Baseline for Robotic Manipulation [46.38114519538192]
Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic manipulation.<n>We introduce SimVLA, a streamlined baseline designed to establish a transparent reference point for VLA research.
arXiv Detail & Related papers (2026-02-20T14:04:27Z) - Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions [18.455501447828343]
Spatial Intelligence (SI) has predominantly relied on Vision-Language Models (VLMs)<n>We introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input.<n>We find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential.
arXiv Detail & Related papers (2026-01-07T05:13:52Z) - LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z) - Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting [70.83781268763215]
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training.<n>VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion.<n>This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems.
arXiv Detail & Related papers (2025-08-06T09:03:10Z) - Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities [56.68006021126077]
We introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots.<n>For the first time, we evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines.<n>Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls.
arXiv Detail & Related papers (2025-07-17T11:46:00Z) - HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions [64.69468932145234]
We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints.<n>Results show that explicit social modeling improves navigation robustness and reduces collisions.
arXiv Detail & Related papers (2025-03-18T13:05:55Z) - AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans [2.940962519388297]
We propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN)<n>AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles.<n>We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.
arXiv Detail & Related papers (2024-11-27T17:36:08Z) - Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z) - LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for Locomotion [20.545058017790428]
Imitation Learning holds great promise for enabling agile locomotion in embodied agents.
We present a novel benchmark designed to facilitate rigorous evaluation and comparison of IL algorithms.
This benchmark encompasses a diverse set of environments, including quadrupeds, bipeds, and musculoskeletal human models.
arXiv Detail & Related papers (2023-11-04T19:41:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.