LangNavBench: Evaluation of Natural Language Understanding in Semantic Navigation
- URL: http://arxiv.org/abs/2507.07299v1
- Date: Wed, 09 Jul 2025 21:46:43 GMT
- Title: LangNavBench: Evaluation of Natural Language Understanding in Semantic Navigation
- Authors: Sonia Raychaudhuri, Enrico Cancelli, Tommaso Campari, Lamberto Ballan, Manolis Savva, Angel X. Chang,
- Abstract summary: LangNav is an open-set dataset specifically created to test an agent's ability to locate objects described at different levels of detail.<n>LangNavBench allows us to systematically compare models on their handling of attributes, spatial and relational cues, and category hierarchies.<n>MLFM is a method that builds a queryable multi-layered semantic map.
- Score: 18.951580080771432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in large vision-language models has driven improvements in language-based semantic navigation, where an embodied agent must reach a target object described in natural language. Despite these advances, we still lack a clear, language-focused benchmark for testing how well such agents ground the words in their instructions. We address this gap with LangNav, an open-set dataset specifically created to test an agent's ability to locate objects described at different levels of detail, from broad category names to fine attributes and object-object relations. Every description in LangNav was manually checked, yielding a lower error rate than existing lifelong- and semantic-navigation datasets. On top of LangNav we build LangNavBench, a benchmark that measures how well current semantic-navigation methods understand and act on these descriptions while moving toward their targets. LangNavBench allows us to systematically compare models on their handling of attributes, spatial and relational cues, and category hierarchies, offering the first thorough, language-centric evaluation of embodied navigation systems. We also present Multi-Layered Feature Map (MLFM), a method that builds a queryable multi-layered semantic map, particularly effective when dealing with small objects or instructions involving spatial relations. MLFM outperforms state-of-the-art mapping-based navigation baselines on the LangNav dataset.
Related papers
- NavComposer: Composing Language Instructions for Navigation Trajectories through Action-Scene-Object Modularization [17.525269369227786]
We propose NavComposer, a framework for automatically generating high-quality navigation instructions.<n>NavComposer explicitly decomposes semantic entities such as actions, scenes, and objects, and recomposes them into natural language instructions.<n>It operates in a data-agnostic manner, supporting adaptation to diverse navigation trajectories without domain-specific training.<n>NavInstrCritic provides a holistic evaluation of instruction quality, addressing limitations of traditional metrics that rely heavily on expert annotations.
arXiv Detail & Related papers (2025-07-15T01:20:22Z) - NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM [55.79954652783797]
Vision-and-Language Navigation (VLN) is an essential skill for embodied agents, allowing them to navigate in 3D environments following natural language instructions.<n>Previous methods translate trajectory videos into step-by-step instructions for expanding data, but such instructions do not match well with users' communication styles.<n>We propose NavRAG, a retrieval-augmented generation framework that generates user demand instructions for VLN.
arXiv Detail & Related papers (2025-02-16T14:17:36Z) - TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation [52.422619828854984]
We introduce TopV-Nav, an MLLM-based method that directly reasons on the top-view map with sufficient spatial information.<n>To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method.
arXiv Detail & Related papers (2024-11-25T14:27:55Z) - MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains [4.941781282578696]
In the Vision-and-Language Navigation (VLN) task, the agent is required to navigate to a destination following a natural language instruction.
While learning-based approaches have been a major solution to the task, they suffer from high training costs and lack of interpretability.
Recently, Large Language Models (LLMs) have emerged as a promising tool for VLN due to their strong generalization capabilities.
arXiv Detail & Related papers (2024-05-17T08:33:27Z) - GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation [65.71524410114797]
GOAT-Bench is a benchmark for the universal navigation task GO to AnyThing (GOAT)
In GOAT, the agent is directed to navigate to a sequence of targets specified by the category name, language description, or image.
We benchmark monolithic RL and modular methods on the GOAT task, analyzing their performance across modalities.
arXiv Detail & Related papers (2024-04-09T20:40:00Z) - IVLMap: Instance-Aware Visual Language Grounding for Consumer Robot Navigation [10.006058028927907]
Vision-and-Language Navigation (VLN) is a challenging task that requires a robot to navigate in photo-realistic environments with human natural language promptings.
Recent studies aim to handle this task by constructing the semantic spatial map representation of the environment.
We propose a new method, namely, Instance-aware Visual Language Map (IVLMap), to empower the robot with instance-level and attribute-level semantic mapping.
arXiv Detail & Related papers (2024-03-28T11:52:42Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments [14.179677726976056]
SayNav is a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigation tasks.
SayNav achieves state-of-the-art results and even outperforms an oracle based baseline with strong ground-truth assumptions by more than 8% in terms of success rate.
arXiv Detail & Related papers (2023-09-08T02:24:37Z) - Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions.
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z) - FILM: Following Instructions in Language with Modular Methods [109.73082108379936]
Recent methods for embodied instruction following are typically trained end-to-end using imitation learning.
We propose a modular method with structured representations that builds a semantic map of the scene and performs exploration with a semantic search policy.
Our findings suggest that an explicit spatial memory and a semantic search policy can provide a stronger and more general representation for state-tracking and guidance.
arXiv Detail & Related papers (2021-10-12T16:40:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.