Related papers: VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation

VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation

URL: http://arxiv.org/abs/2512.19021v1
Date: Mon, 22 Dec 2025 04:27:26 GMT
Title: VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation
Authors: Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang, Rong Wei, Rui Tang, Juncheng Li, Hanqing Wang, Jiangmiao Pang, Anton van den Hengel, Jiajun Liu, Qi Wu,
Abstract summary: We introduce VLNVerse, a benchmark for Versatile, Embodied, Realistic Simulation, and Evaluation.<n>VLNVerse redefines VLN as a scalable, full-stack embodied AI problem.<n>We propose a novel unified multi-task model capable of addressing all tasks within the benchmark.
Score: 61.82502719679122
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite remarkable progress in Vision-Language Navigation (VLN), existing benchmarks remain confined to fixed, small-scale datasets with naive physical simulation. These shortcomings limit the insight that the benchmarks provide into sim-to-real generalization, and create a significant research gap. Furthermore, task fragmentation prevents unified/shared progress in the area, while limited data scales fail to meet the demands of modern LLM-based pretraining. To overcome these limitations, we introduce VLNVerse: a new large-scale, extensible benchmark designed for Versatile, Embodied, Realistic Simulation, and Evaluation. VLNVerse redefines VLN as a scalable, full-stack embodied AI problem. Its Versatile nature unifies previously fragmented tasks into a single framework and provides an extensible toolkit for researchers. Its Embodied design moves beyond intangible and teleporting "ghost" agents that support full-kinematics in a Realistic Simulation powered by a robust physics engine. We leverage the scale and diversity of VLNVerse to conduct a comprehensive Evaluation of existing methods, from classic models to MLLM-based agents. We also propose a novel unified multi-task model capable of addressing all tasks within the benchmark. VLNVerse aims to narrow the gap between simulated navigation and real-world generalization, providing the community with a vital tool to boost research towards scalable, general-purpose embodied locomotion agents.

Related papers

Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z)
SimVLA: A Simple VLA Baseline for Robotic Manipulation [46.38114519538192]
Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic manipulation.<n>We introduce SimVLA, a streamlined baseline designed to establish a transparent reference point for VLA research.
arXiv Detail & Related papers (2026-02-20T14:04:27Z)
Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions [18.455501447828343]
Spatial Intelligence (SI) has predominantly relied on Vision-Language Models (VLMs)<n>We introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input.<n>We find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential.
arXiv Detail & Related papers (2026-01-07T05:13:52Z)
LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z)
Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting [70.83781268763215]
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training.<n>VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion.<n>This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems.
arXiv Detail & Related papers (2025-08-06T09:03:10Z)
Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities [56.68006021126077]
We introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots.<n>For the first time, we evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines.<n>Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls.
arXiv Detail & Related papers (2025-07-17T11:46:00Z)
HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions [64.69468932145234]
We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints.<n>Results show that explicit social modeling improves navigation robustness and reduces collisions.
arXiv Detail & Related papers (2025-03-18T13:05:55Z)
AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans [2.940962519388297]
We propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN)<n>AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles.<n>We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.
arXiv Detail & Related papers (2024-11-27T17:36:08Z)
Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z)
LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for Locomotion [20.545058017790428]
Imitation Learning holds great promise for enabling agile locomotion in embodied agents. We present a novel benchmark designed to facilitate rigorous evaluation and comparison of IL algorithms. This benchmark encompasses a diverse set of environments, including quadrupeds, bipeds, and musculoskeletal human models.
arXiv Detail & Related papers (2023-11-04T19:41:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.