MLFM: Multi-Layered Feature Maps for Richer Language Understanding in Zero-Shot Semantic Navigation
- URL: http://arxiv.org/abs/2507.07299v2
- Date: Fri, 17 Oct 2025 00:58:38 GMT
- Title: MLFM: Multi-Layered Feature Maps for Richer Language Understanding in Zero-Shot Semantic Navigation
- Authors: Sonia Raychaudhuri, Enrico Cancelli, Tommaso Campari, Lamberto Ballan, Manolis Savva, Angel X. Chang,
- Abstract summary: LangNav is an open-vocabulary multi-object navigation dataset with natural language goal descriptions.<n> MLFM builds a queryable, multi-layered semantic map from pretrained vision-language features.<n>Experiments on LangNav show that MLFM outperforms state-of-the-art zero-shot mapping-based navigation baselines.
- Score: 25.63797039823049
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in large vision-language models has driven improvements in language-based semantic navigation, where an embodied agent must reach a target object described in natural language. Yet we still lack a clear, language-focused evaluation framework to test how well agents ground the words in their instructions. We address this gap by proposing LangNav, an open-vocabulary multi-object navigation dataset with natural language goal descriptions (e.g. 'go to the red short candle on the table') and corresponding fine-grained linguistic annotations (e.g., attributes: color=red, size=short; relations: support=on). These labels enable systematic evaluation of language understanding. To evaluate on this setting, we extend multi-object navigation task setting to Language-guided Multi-Object Navigation (LaMoN), where the agent must find a sequence of goals specified using language. Furthermore, we propose Multi-Layered Feature Map (MLFM), a novel method that builds a queryable, multi-layered semantic map from pretrained vision-language features and proves effective for reasoning over fine-grained attributes and spatial relations in goal descriptions. Experiments on LangNav show that MLFM outperforms state-of-the-art zero-shot mapping-based navigation baselines.
Related papers
- LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation [34.074871694181965]
We introduce HieraNav, a goal navigation task where agents interpret natural language instructions to reach targets at four semantic levels.<n>We present Language as a Map (LangMap), a benchmark built on real-world 3D indoor scans with comprehensive human-verified annotations.<n>LangMap achieves superior annotation quality, outperforming GOAT-Bench by 23.8% in discriminative accuracy using four times fewer words.
arXiv Detail & Related papers (2026-02-02T15:26:19Z) - ReasonNavi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation [53.95797153529148]
Embodied agents often struggle with efficient navigation because they rely primarily on partial egocentric observations.<n>We introduce ReasonNavi, a human-inspired framework that operationalizes this reason-then-act paradigm by coupling Multimodal Large Language Models (MLLMs) with deterministic planners.
arXiv Detail & Related papers (2026-01-26T19:09:20Z) - NavComposer: Composing Language Instructions for Navigation Trajectories through Action-Scene-Object Modularization [17.525269369227786]
We propose NavComposer, a framework for automatically generating high-quality navigation instructions.<n>NavComposer explicitly decomposes semantic entities such as actions, scenes, and objects, and recomposes them into natural language instructions.<n>It operates in a data-agnostic manner, supporting adaptation to diverse navigation trajectories without domain-specific training.<n>NavInstrCritic provides a holistic evaluation of instruction quality, addressing limitations of traditional metrics that rely heavily on expert annotations.
arXiv Detail & Related papers (2025-07-15T01:20:22Z) - Multimodal Spatial Language Maps for Robot Navigation and Manipulation [32.852583241593436]
multimodal spatial language maps are a spatial map representation that fuses pretrained multimodal features with a 3D reconstruction of the environment.<n>We present two instances of our maps, which are visual-language maps (VLMaps) and their extension to audio-visual-language maps (AVLMaps)<n>These capabilities extend to mobile robots and tabletop manipulators, supporting navigation and interaction guided by visual, audio, and spatial cues.
arXiv Detail & Related papers (2025-06-07T17:02:13Z) - NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM [55.79954652783797]
Vision-and-Language Navigation (VLN) is an essential skill for embodied agents, allowing them to navigate in 3D environments following natural language instructions.<n>Previous methods translate trajectory videos into step-by-step instructions for expanding data, but such instructions do not match well with users' communication styles.<n>We propose NavRAG, a retrieval-augmented generation framework that generates user demand instructions for VLN.
arXiv Detail & Related papers (2025-02-16T14:17:36Z) - TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation [52.422619828854984]
We introduce TopV-Nav, an MLLM-based method that directly reasons on the top-view map with sufficient spatial information.<n>To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method.
arXiv Detail & Related papers (2024-11-25T14:27:55Z) - MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains [4.941781282578696]
In the Vision-and-Language Navigation (VLN) task, the agent is required to navigate to a destination following a natural language instruction.
While learning-based approaches have been a major solution to the task, they suffer from high training costs and lack of interpretability.
Recently, Large Language Models (LLMs) have emerged as a promising tool for VLN due to their strong generalization capabilities.
arXiv Detail & Related papers (2024-05-17T08:33:27Z) - Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM [6.475074453206891]
Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries.<n>We show that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks.<n>We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query.
arXiv Detail & Related papers (2024-04-27T14:20:46Z) - GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation [65.71524410114797]
GOAT-Bench is a benchmark for the universal navigation task GO to AnyThing (GOAT)
In GOAT, the agent is directed to navigate to a sequence of targets specified by the category name, language description, or image.
We benchmark monolithic RL and modular methods on the GOAT task, analyzing their performance across modalities.
arXiv Detail & Related papers (2024-04-09T20:40:00Z) - IVLMap: Instance-Aware Visual Language Grounding for Consumer Robot Navigation [10.006058028927907]
Vision-and-Language Navigation (VLN) is a challenging task that requires a robot to navigate in photo-realistic environments with human natural language promptings.
Recent studies aim to handle this task by constructing the semantic spatial map representation of the environment.
We propose a new method, namely, Instance-aware Visual Language Map (IVLMap), to empower the robot with instance-level and attribute-level semantic mapping.
arXiv Detail & Related papers (2024-03-28T11:52:42Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments [14.179677726976056]
SayNav is a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigation tasks.
SayNav achieves state-of-the-art results and even outperforms an oracle based baseline with strong ground-truth assumptions by more than 8% in terms of success rate.
arXiv Detail & Related papers (2023-09-08T02:24:37Z) - Simple Embodied Language Learning as a Byproduct of Meta-Reinforcement
Learning [56.07190845063208]
We ask: can embodied reinforcement learning (RL) agents indirectly learn language from non-language tasks?
We design an office navigation environment, where the agent's goal is to find a particular office, and office locations differ in different buildings (i.e., tasks)
We find RL agents indeed are able to indirectly learn language. Agents trained with current meta-RL algorithms successfully generalize to reading floor plans with held-out layouts and language phrases.
arXiv Detail & Related papers (2023-06-14T09:48:48Z) - Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions.
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z) - FILM: Following Instructions in Language with Modular Methods [109.73082108379936]
Recent methods for embodied instruction following are typically trained end-to-end using imitation learning.
We propose a modular method with structured representations that builds a semantic map of the scene and performs exploration with a semantic search policy.
Our findings suggest that an explicit spatial memory and a semantic search policy can provide a stronger and more general representation for state-tracking and guidance.
arXiv Detail & Related papers (2021-10-12T16:40:01Z) - Improving Cross-Modal Alignment in Vision Language Navigation via
Syntactic Information [83.62098382773266]
Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions.
We propose a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes.
Our agent achieves the new state-of-the-art on Room-Across-Room dataset, which contains instructions in 3 languages.
arXiv Detail & Related papers (2021-04-19T19:18:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.