Related papers: MLFM: Multi-Layered Feature Maps for Richer Language Understanding in Zero-Shot Semantic Navigation

MLFM: Multi-Layered Feature Maps for Richer Language Understanding in Zero-Shot Semantic Navigation

URL: http://arxiv.org/abs/2507.07299v2
Date: Fri, 17 Oct 2025 00:58:38 GMT
Title: MLFM: Multi-Layered Feature Maps for Richer Language Understanding in Zero-Shot Semantic Navigation
Authors: Sonia Raychaudhuri, Enrico Cancelli, Tommaso Campari, Lamberto Ballan, Manolis Savva, Angel X. Chang,
Abstract summary: LangNav is an open-vocabulary multi-object navigation dataset with natural language goal descriptions.<n> MLFM builds a queryable, multi-layered semantic map from pretrained vision-language features.<n>Experiments on LangNav show that MLFM outperforms state-of-the-art zero-shot mapping-based navigation baselines.
Score: 25.63797039823049
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in large vision-language models has driven improvements in language-based semantic navigation, where an embodied agent must reach a target object described in natural language. Yet we still lack a clear, language-focused evaluation framework to test how well agents ground the words in their instructions. We address this gap by proposing LangNav, an open-vocabulary multi-object navigation dataset with natural language goal descriptions (e.g. 'go to the red short candle on the table') and corresponding fine-grained linguistic annotations (e.g., attributes: color=red, size=short; relations: support=on). These labels enable systematic evaluation of language understanding. To evaluate on this setting, we extend multi-object navigation task setting to Language-guided Multi-Object Navigation (LaMoN), where the agent must find a sequence of goals specified using language. Furthermore, we propose Multi-Layered Feature Map (MLFM), a novel method that builds a queryable, multi-layered semantic map from pretrained vision-language features and proves effective for reasoning over fine-grained attributes and spatial relations in goal descriptions. Experiments on LangNav show that MLFM outperforms state-of-the-art zero-shot mapping-based navigation baselines.

Related papers

LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation [34.074871694181965]
We introduce HieraNav, a goal navigation task where agents interpret natural language instructions to reach targets at four semantic levels.<n>We present Language as a Map (LangMap), a benchmark built on real-world 3D indoor scans with comprehensive human-verified annotations.<n>LangMap achieves superior annotation quality, outperforming GOAT-Bench by 23.8% in discriminative accuracy using four times fewer words.
arXiv Detail & Related papers (2026-02-02T15:26:19Z)
ReasonNavi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation [53.95797153529148]
Embodied agents often struggle with efficient navigation because they rely primarily on partial egocentric observations.<n>We introduce ReasonNavi, a human-inspired framework that operationalizes this reason-then-act paradigm by coupling Multimodal Large Language Models (MLLMs) with deterministic planners.
arXiv Detail & Related papers (2026-01-26T19:09:20Z)
NavComposer: Composing Language Instructions for Navigation Trajectories through Action-Scene-Object Modularization [17.525269369227786]
We propose NavComposer, a framework for automatically generating high-quality navigation instructions.<n>NavComposer explicitly decomposes semantic entities such as actions, scenes, and objects, and recomposes them into natural language instructions.<n>It operates in a data-agnostic manner, supporting adaptation to diverse navigation trajectories without domain-specific training.<n>NavInstrCritic provides a holistic evaluation of instruction quality, addressing limitations of traditional metrics that rely heavily on expert annotations.
arXiv Detail & Related papers (2025-07-15T01:20:22Z)
Multimodal Spatial Language Maps for Robot Navigation and Manipulation [32.852583241593436]
multimodal spatial language maps are a spatial map representation that fuses pretrained multimodal features with a 3D reconstruction of the environment.<n>We present two instances of our maps, which are visual-language maps (VLMaps) and their extension to audio-visual-language maps (AVLMaps)<n>These capabilities extend to mobile robots and tabletop manipulators, supporting navigation and interaction guided by visual, audio, and spatial cues.
arXiv Detail & Related papers (2025-06-07T17:02:13Z)
NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM [55.79954652783797]
Vision-and-Language Navigation (VLN) is an essential skill for embodied agents, allowing them to navigate in 3D environments following natural language instructions.<n>Previous methods translate trajectory videos into step-by-step instructions for expanding data, but such instructions do not match well with users' communication styles.<n>We propose NavRAG, a retrieval-augmented generation framework that generates user demand instructions for VLN.
arXiv Detail & Related papers (2025-02-16T14:17:36Z)
TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation [52.422619828854984]
We introduce TopV-Nav, an MLLM-based method that directly reasons on the top-view map with sufficient spatial information.<n>To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method.
arXiv Detail & Related papers (2024-11-25T14:27:55Z)
MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains [4.941781282578696]
In the Vision-and-Language Navigation (VLN) task, the agent is required to navigate to a destination following a natural language instruction. While learning-based approaches have been a major solution to the task, they suffer from high training costs and lack of interpretability. Recently, Large Language Models (LLMs) have emerged as a promising tool for VLN due to their strong generalization capabilities.
arXiv Detail & Related papers (2024-05-17T08:33:27Z)
Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM [6.475074453206891]
Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries.<n>We show that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks.<n>We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query.
arXiv Detail & Related papers (2024-04-27T14:20:46Z)
GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation [65.71524410114797]
GOAT-Bench is a benchmark for the universal navigation task GO to AnyThing (GOAT) In GOAT, the agent is directed to navigate to a sequence of targets specified by the category name, language description, or image. We benchmark monolithic RL and modular methods on the GOAT task, analyzing their performance across modalities.
arXiv Detail & Related papers (2024-04-09T20:40:00Z)
IVLMap: Instance-Aware Visual Language Grounding for Consumer Robot Navigation [10.006058028927907]
Vision-and-Language Navigation (VLN) is a challenging task that requires a robot to navigate in photo-realistic environments with human natural language promptings. Recent studies aim to handle this task by constructing the semantic spatial map representation of the environment. We propose a new method, namely, Instance-aware Visual Language Map (IVLMap), to empower the robot with instance-level and attribute-level semantic mapping.
arXiv Detail & Related papers (2024-03-28T11:52:42Z)
LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN) Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z)
SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments [14.179677726976056]
SayNav is a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigation tasks. SayNav achieves state-of-the-art results and even outperforms an oracle based baseline with strong ground-truth assumptions by more than 8% in terms of success rate.
arXiv Detail & Related papers (2023-09-08T02:24:37Z)
Simple Embodied Language Learning as a Byproduct of Meta-Reinforcement Learning [56.07190845063208]
We ask: can embodied reinforcement learning (RL) agents indirectly learn language from non-language tasks? We design an office navigation environment, where the agent's goal is to find a particular office, and office locations differ in different buildings (i.e., tasks) We find RL agents indeed are able to indirectly learn language. Agents trained with current meta-RL algorithms successfully generalize to reading floor plans with held-out layouts and language phrases.
arXiv Detail & Related papers (2023-06-14T09:48:48Z)
Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions. To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects. We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z)
FILM: Following Instructions in Language with Modular Methods [109.73082108379936]
Recent methods for embodied instruction following are typically trained end-to-end using imitation learning. We propose a modular method with structured representations that builds a semantic map of the scene and performs exploration with a semantic search policy. Our findings suggest that an explicit spatial memory and a semantic search policy can provide a stronger and more general representation for state-tracking and guidance.
arXiv Detail & Related papers (2021-10-12T16:40:01Z)
Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information [83.62098382773266]
Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions. We propose a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes. Our agent achieves the new state-of-the-art on Room-Across-Room dataset, which contains instructions in 3 languages.
arXiv Detail & Related papers (2021-04-19T19:18:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.