NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants
- URL: http://arxiv.org/abs/2502.13894v1
- Date: Wed, 19 Feb 2025 17:27:47 GMT
- Title: NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants
- Authors: Yiran Qin, Ao Sun, Yuze Hong, Benyou Wang, Ruimao Zhang,
- Abstract summary: Navigating unfamiliar environments presents significant challenges for household robots.
Existing reinforcement learning methods cannot be directly transferred to new environments.
We try to transfer the logical knowledge and the generalization ability of pre-trained foundation models to zero-shot navigation.
- Score: 24.689242976554482
- License:
- Abstract: Navigating unfamiliar environments presents significant challenges for household robots, requiring the ability to recognize and reason about novel decoration and layout. Existing reinforcement learning methods cannot be directly transferred to new environments, as they typically rely on extensive mapping and exploration, leading to time-consuming and inefficient. To address these challenges, we try to transfer the logical knowledge and the generalization ability of pre-trained foundation models to zero-shot navigation. By integrating a large vision-language model with a diffusion network, our approach named \mname ~constructs a visual predictor that continuously predicts the agent's potential observations in the next step which can assist robots generate robust actions. Furthermore, to adapt the temporal property of navigation, we introduce temporal historical information to ensure that the predicted image is aligned with the navigation scene. We then carefully designed an information fusion framework that embeds the predicted future frames as guidance into goal-reaching policy to solve downstream image navigation tasks. This approach enhances navigation control and generalization across both simulated and real-world environments. Through extensive experimentation, we demonstrate the robustness and versatility of our method, showcasing its potential to improve the efficiency and effectiveness of robotic navigation in diverse settings.
Related papers
- Navigation World Models [68.58459393846461]
We introduce a controllable video generation model that predicts future visual observations based on past observations and navigation actions.
In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal.
Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy.
arXiv Detail & Related papers (2024-12-04T18:59:45Z) - IN-Sight: Interactive Navigation through Sight [20.184155117341497]
IN-Sight is a novel approach to self-supervised path planning.
It calculates traversability scores and incorporates them into a semantic map.
To precisely navigate around obstacles, IN-Sight employs a local planner.
arXiv Detail & Related papers (2024-08-01T07:27:54Z) - A Role of Environmental Complexity on Representation Learning in Deep Reinforcement Learning Agents [3.7314353481448337]
We developed a simulated navigation environment to train deep reinforcement learning agents.
We modulated the frequency of exposure to a shortcut and navigation cue, leading to the development of artificial agents with differing abilities.
We examined the encoded representations in artificial neural networks driving these agents, revealing intricate dynamics in representation learning.
arXiv Detail & Related papers (2024-07-03T18:27:26Z) - Interactive Semantic Map Representation for Skill-based Visual Object
Navigation [43.71312386938849]
This paper introduces a new representation of a scene semantic map formed during the embodied agent interaction with the indoor environment.
We have implemented this representation into a full-fledged navigation approach called SkillTron.
The proposed approach makes it possible to form both intermediate goals for robot exploration and the final goal for object navigation.
arXiv Detail & Related papers (2023-11-07T16:30:12Z) - Learning Navigational Visual Representations with Semantic Map
Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps.
Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z) - Visual-Language Navigation Pretraining via Prompt-based Environmental
Self-exploration [83.96729205383501]
We introduce prompt-based learning to achieve fast adaptation for language embeddings.
Our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE.
arXiv Detail & Related papers (2022-03-08T11:01:24Z) - Lifelong Topological Visual Navigation [16.41858724205884]
We propose a learning-based visual navigation method with graph update strategies that improve lifelong navigation performance over time.
We take inspiration from sampling-based planning algorithms to build image-based topological graphs, resulting in sparser graphs yet with higher navigation performance compared to baseline methods.
Unlike controllers that learn from fixed training environments, we show that our model can be finetuned using a relatively small dataset from the real-world environment where the robot is deployed.
arXiv Detail & Related papers (2021-10-16T06:16:14Z) - Deep Learning for Embodied Vision Navigation: A Survey [108.13766213265069]
"Embodied visual navigation" problem requires an agent to navigate in a 3D environment mainly rely on its first-person observation.
This paper attempts to establish an outline of the current works in the field of embodied visual navigation by providing a comprehensive literature survey.
arXiv Detail & Related papers (2021-07-07T12:09:04Z) - Rapid Exploration for Open-World Navigation with Latent Goal Models [78.45339342966196]
We describe a robotic learning system for autonomous exploration and navigation in diverse, open-world environments.
At the core of our method is a learned latent variable model of distances and actions, along with a non-parametric topological memory of images.
We use an information bottleneck to regularize the learned policy, giving us (i) a compact visual representation of goals, (ii) improved generalization capabilities, and (iii) a mechanism for sampling feasible goals for exploration.
arXiv Detail & Related papers (2021-04-12T23:14:41Z) - ViNG: Learning Open-World Navigation with Visual Goals [82.84193221280216]
We propose a learning-based navigation system for reaching visually indicated goals.
We show that our system, which we call ViNG, outperforms previously-proposed methods for goal-conditioned reinforcement learning.
We demonstrate ViNG on a number of real-world applications, such as last-mile delivery and warehouse inspection.
arXiv Detail & Related papers (2020-12-17T18:22:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.