RANa: Retrieval-Augmented Navigation
- URL: http://arxiv.org/abs/2504.03524v1
- Date: Fri, 04 Apr 2025 15:22:02 GMT
- Title: RANa: Retrieval-Augmented Navigation
- Authors: Gianluca Monaci, Rafael S. Rezende, Romain Deffayet, Gabriela Csurka, Guillaume Bono, Hervé Déjean, Stéphane Clinchant, Christian Wolf,
- Abstract summary: We introduce a new retrieval-augmented agent, trained with RL, capable of querying a database collected from previous episodes in the same environment.<n>We show that retrieval allows zero-shot transfer across tasks and environments while significantly improving performance.
- Score: 18.651051807288013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Methods for navigation based on large-scale learning typically treat each episode as a new problem, where the agent is spawned with a clean memory in an unknown environment. While these generalization capabilities to an unknown environment are extremely important, we claim that, in a realistic setting, an agent should have the capacity of exploiting information collected during earlier robot operations. We address this by introducing a new retrieval-augmented agent, trained with RL, capable of querying a database collected from previous episodes in the same environment and learning how to integrate this additional context information. We introduce a unique agent architecture for the general navigation task, evaluated on ObjectNav, ImageNav and Instance-ImageNav. Our retrieval and context encoding methods are data-driven and heavily employ vision foundation models (FM) for both semantic and geometric understanding. We propose new benchmarks for these settings and we show that retrieval allows zero-shot transfer across tasks and environments while significantly improving performance.
Related papers
- NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants [24.689242976554482]
Navigating unfamiliar environments presents significant challenges for household robots.<n>Existing reinforcement learning methods cannot be directly transferred to new environments.<n>We try to transfer the logical knowledge and the generalization ability of pre-trained foundation models to zero-shot navigation.
arXiv Detail & Related papers (2025-02-19T17:27:47Z) - Augmented Commonsense Knowledge for Remote Object Grounding [67.30864498454805]
We propose an augmented commonsense knowledge model (ACK) to leverage commonsense information as atemporal knowledge graph for improving agent navigation.
ACK consists of knowledge graph-aware cross-modal and concept aggregation modules to enhance visual representation and visual-textual data alignment.
We add a new pipeline for the commonsense-based decision-making process which leads to more accurate local action prediction.
arXiv Detail & Related papers (2024-06-03T12:12:33Z) - Right Place, Right Time! Dynamizing Topological Graphs for Embodied Navigation [55.581423861790945]
Embodied Navigation tasks often involve constructing topological graphs of a scene during exploration.<n>We introduce structured object transitions to dynamize static topological graphs called Object Transition Graphs (OTGs)<n>OTGs simulate portable targets following structured routes inspired by human habits.
arXiv Detail & Related papers (2024-03-14T22:33:22Z) - Interpretable Brain-Inspired Representations Improve RL Performance on
Visual Navigation Tasks [0.0]
We show how the method of slow feature analysis (SFA) overcomes both limitations by generating interpretable representations of visual data.
We employ SFA in a modern reinforcement learning context, analyse and compare representations and illustrate where hierarchical SFA can outperform other feature extractors on navigation tasks.
arXiv Detail & Related papers (2024-02-19T11:35:01Z) - KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation [61.08389704326803]
Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes.
Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates.
We propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability.
arXiv Detail & Related papers (2023-03-28T08:00:46Z) - Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions.
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z) - Visual-Language Navigation Pretraining via Prompt-based Environmental
Self-exploration [83.96729205383501]
We introduce prompt-based learning to achieve fast adaptation for language embeddings.
Our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE.
arXiv Detail & Related papers (2022-03-08T11:01:24Z) - Explore before Moving: A Feasible Path Estimation and Memory Recalling
Framework for Embodied Navigation [117.26891277593205]
We focus on the navigation and solve the problem of existing navigation algorithms lacking experience and common sense.
Inspired by the human ability to think twice before moving and conceive several feasible paths to seek a goal in unfamiliar scenes, we present a route planning method named Path Estimation and Memory Recalling framework.
We show strong experimental results of PEMR on the EmbodiedQA navigation task.
arXiv Detail & Related papers (2021-10-16T13:30:55Z) - Exploiting Scene-specific Features for Object Goal Navigation [9.806910643086043]
We introduce a new reduced dataset that speeds up the training of navigation models.
Our proposed dataset permits the training of models that do not exploit online-built maps in reasonable times.
We propose the SMTSC model, an attention-based model capable of exploiting the correlation between scenes and objects contained in them.
arXiv Detail & Related papers (2020-08-21T10:16:01Z) - Take the Scenic Route: Improving Generalization in Vision-and-Language
Navigation [44.019674347733506]
We investigate the popular Room-to-Room (R2R) VLN benchmark and discover that what is important is not only the amount of data you synthesize, but also how you do it.
We find that shortest path sampling, which is used by both the R2R benchmark and existing augmentation methods, encode biases in the action space of the agent which we dub as action priors.
We then show that these action priors offer one explanation toward the poor generalization of existing works.
arXiv Detail & Related papers (2020-03-31T14:52:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.