Evolving Graphical Planner: Contextual Global Planning for
Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2007.05655v1
- Date: Sat, 11 Jul 2020 00:21:05 GMT
- Title: Evolving Graphical Planner: Contextual Global Planning for
Vision-and-Language Navigation
- Authors: Zhiwei Deng, Karthik Narasimhan, Olga Russakovsky
- Abstract summary: We introduce the Evolving Graphical Planner (EGP), a model that performs global planning for navigation based on raw sensory input.
We evaluate our model on a challenging Vision-and-Language Navigation (VLN) task with photorealistic images and achieve superior performance compared to previous navigation architectures.
- Score: 47.79784520827089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability to perform effective planning is crucial for building an
instruction-following agent. When navigating through a new environment, an
agent is challenged with (1) connecting the natural language instructions with
its progressively growing knowledge of the world; and (2) performing long-range
planning and decision making in the form of effective exploration and error
correction. Current methods are still limited on both fronts despite extensive
efforts. In this paper, we introduce the Evolving Graphical Planner (EGP), a
model that performs global planning for navigation based on raw sensory input.
The model dynamically constructs a graphical representation, generalizes the
action space to allow for more flexible decision making, and performs efficient
planning on a proxy graph representation. We evaluate our model on a
challenging Vision-and-Language Navigation (VLN) task with photorealistic
images and achieve superior performance compared to previous navigation
architectures. For instance, we achieve a 53% success rate on the test split of
the Room-to-Room navigation task through pure imitation learning, outperforming
previous navigation architectures by up to 5%.
Related papers
- Aligning Knowledge Graph with Visual Perception for Object-goal Navigation [16.32780793344835]
We propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation.
Our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception.
The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability.
arXiv Detail & Related papers (2024-02-29T06:31:18Z) - Towards Learning a Generalist Model for Embodied Navigation [24.816490551945435]
We propose the first generalist model for embodied navigation, NaviLLM.
It adapts LLMs to embodied navigation by introducing schema-based instruction.
We conduct extensive experiments to evaluate the performance and generalizability of our model.
arXiv Detail & Related papers (2023-12-04T16:32:51Z) - E(2)-Equivariant Graph Planning for Navigation [26.016209191573605]
We exploit Euclidean symmetry in planning for 2D navigation.
To address the challenges of unstructured environments, we formulate the navigation problem as planning on a geometric graph.
arXiv Detail & Related papers (2023-09-22T17:59:48Z) - CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation [73.78984332354636]
CorNav is a novel zero-shot framework for vision-and-language navigation.
It incorporates environmental feedback for refining future plans and adjusting its actions.
It consistently outperforms all baselines in a zero-shot multi-task setting.
arXiv Detail & Related papers (2023-06-17T11:44:04Z) - Improving Vision-and-Language Navigation by Generating Future-View Image
Semantics [96.8435716885159]
Vision-and-Language Navigation (VLN) is the task that requires an agent to navigate through the environment based on natural language instructions.
We propose three proxy tasks during the agent's in-domain pre-training: Masked Panorama Modeling (MPM), Masked Trajectory Modeling (MTM), and Action Prediction with Image Generation (APIG)
We then fine-tune the agent on the VLN task with an auxiliary loss that minimizes the difference between the view semantics generated by the agent and the ground truth view semantics of the next step.
arXiv Detail & Related papers (2023-04-11T00:36:02Z) - Target-Driven Structured Transformer Planner for Vision-Language
Navigation [55.81329263674141]
We propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation.
Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target.
In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning.
arXiv Detail & Related papers (2022-07-19T06:46:21Z) - Waypoint Models for Instruction-guided Navigation in Continuous
Environments [68.2912740006109]
We develop a class of language-conditioned waypoint prediction networks to examine this question.
We measure task performance and estimated execution time on a profiled LoCoBot robot.
Our models outperform prior work in VLN-CE and set a new state-of-the-art on the public leaderboard.
arXiv Detail & Related papers (2021-10-05T17:55:49Z) - Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.