PIG-Nav: Key Insights for Pretrained Image Goal Navigation Models
- URL: http://arxiv.org/abs/2507.17220v1
- Date: Wed, 23 Jul 2025 05:34:20 GMT
- Title: PIG-Nav: Key Insights for Pretrained Image Goal Navigation Models
- Authors: Jiansong Wan, Chengming Zhou, Jinkua Liu, Xiangge Huang, Xiaoyu Chen, Xiaohan Yi, Qisen Yang, Baiting Zhu, Xin-Qiang Cai, Lixing Liu, Rushuai Yang, Chuheng Zhang, Sherif Abdelfattah, Hayong Shin, Pushi Zhang, Li Zhao, Jiang Bian,
- Abstract summary: PIG-Nav (Pretrained Image-Goal Navigation) is a new approach that further investigates pretraining strategies for vision-based navigation models.<n>We identify two critical design choices that consistently improve the performance of pretrained navigation models.<n>Our model achieves an average improvement of 22.6% in zero-shot settings and a 37.5% improvement in fine-tuning settings over existing visual navigation foundation models.
- Score: 16.820485795257195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies have explored pretrained (foundation) models for vision-based robotic navigation, aiming to achieve generalizable navigation and positive transfer across diverse environments while enhancing zero-shot performance in unseen settings. In this work, we introduce PIG-Nav (Pretrained Image-Goal Navigation), a new approach that further investigates pretraining strategies for vision-based navigation models and contributes in two key areas. Model-wise, we identify two critical design choices that consistently improve the performance of pretrained navigation models: (1) integrating an early-fusion network structure to combine visual observations and goal images via appropriately pretrained Vision Transformer (ViT) image encoder, and (2) introducing suitable auxiliary tasks to enhance global navigation representation learning, thus further improving navigation performance. Dataset-wise, we propose a novel data preprocessing pipeline for efficiently labeling large-scale game video datasets for navigation model training. We demonstrate that augmenting existing open navigation datasets with diverse gameplay videos improves model performance. Our model achieves an average improvement of 22.6% in zero-shot settings and a 37.5% improvement in fine-tuning settings over existing visual navigation foundation models in two complex simulated environments and one real-world environment. These results advance the state-of-the-art in pretrained image-goal navigation models. Notably, our model maintains competitive performance while requiring significantly less fine-tuning data, highlighting its potential for real-world deployment with minimal labeled supervision.
Related papers
- From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning [59.88543114325153]
We introduce the Seeing-to-Experiencing framework to scale the capability of navigation foundation models with reinforcement learning.<n>S2E combines the strengths of pre-training on videos and post-training through RL.<n>We establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3DGS reconstructions of real-world scenes.
arXiv Detail & Related papers (2025-07-29T17:26:10Z) - NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants [24.689242976554482]
Navigating unfamiliar environments presents significant challenges for household robots.<n>Existing reinforcement learning methods cannot be directly transferred to new environments.<n>We try to transfer the logical knowledge and the generalization ability of pre-trained foundation models to zero-shot navigation.
arXiv Detail & Related papers (2025-02-19T17:27:47Z) - T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design [79.7289790249621]
Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals.
We highlight the crucial importance of tailoring datasets to specific learning objectives.
We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver.
arXiv Detail & Related papers (2024-10-08T04:30:06Z) - VANP: Learning Where to See for Navigation with Self-Supervised Vision-Action Pre-Training [8.479135285935113]
Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation.
Most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects.
We propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP)
arXiv Detail & Related papers (2024-03-12T22:33:08Z) - NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration [57.15811390835294]
This paper describes how we can train a single unified diffusion policy to handle both goal-directed navigation and goal-agnostic exploration.
We show that this unified policy results in better overall performance when navigating to visually indicated goals in novel environments.
Our experiments, conducted on a real-world mobile robot platform, show effective navigation in unseen environments in comparison with five alternative methods.
arXiv Detail & Related papers (2023-10-11T21:07:14Z) - Scaling Data Generation in Vision-and-Language Navigation [116.95534559103788]
We propose an effective paradigm for generating large-scale data for learning.
We apply 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs.
Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning.
arXiv Detail & Related papers (2023-07-28T16:03:28Z) - Learning Navigational Visual Representations with Semantic Map
Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps.
Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z) - ViNT: A Foundation Model for Visual Navigation [52.2571739391896]
Visual Navigation Transformer (ViNT) is a foundation model for vision-based robotic navigation.
ViNT is trained with a general goal-reaching objective that can be used with any navigation dataset.
It exhibits positive transfer, outperforming specialist models trained on singular datasets.
arXiv Detail & Related papers (2023-06-26T16:57:03Z) - Perceptual underwater image enhancement with deep learning and physical
priors [35.37760003463292]
We propose two perceptual enhancement models, each of which uses a deep enhancement model with a detection perceptor.
Due to the lack of training data, a hybrid underwater image synthesis model, which fuses physical priors and data-driven cues, is proposed to synthesize training data.
Experimental results show the superiority of our proposed method over several state-of-the-art methods on both real-world and synthetic underwater datasets.
arXiv Detail & Related papers (2020-08-21T22:11:34Z) - Learning View and Target Invariant Visual Servoing for Navigation [9.873635079670093]
We learn viewpoint invariant and target invariant visual servoing for local mobile robot navigation.
We train deep convolutional network controller to reach the desired goal.
arXiv Detail & Related papers (2020-03-04T20:36:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.