Visual Perception Generalization for Vision-and-Language Navigation via
Meta-Learning
- URL: http://arxiv.org/abs/2012.05446v3
- Date: Tue, 19 Jan 2021 02:39:00 GMT
- Title: Visual Perception Generalization for Vision-and-Language Navigation via
Meta-Learning
- Authors: Ting Wang, Zongkai Wu, Donglin Wang
- Abstract summary: Vision-and-language navigation (VLN) is a challenging task that requires an agent to navigate in real-world environments by understanding natural language instructions and visual information received in real-time.
We propose a visual perception generalization strategy based on meta-learning, which enables the agent to fast adapt to a new camera configuration with a few shots.
- Score: 9.519596058757033
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-and-language navigation (VLN) is a challenging task that requires an
agent to navigate in real-world environments by understanding natural language
instructions and visual information received in real-time. Prior works have
implemented VLN tasks on continuous environments or physical robots, all of
which use a fixed camera configuration due to the limitations of datasets, such
as 1.5 meters height, 90 degrees horizontal field of view (HFOV), etc. However,
real-life robots with different purposes have multiple camera configurations,
and the huge gap in visual information makes it difficult to directly transfer
the learned navigation model between various robots. In this paper, we propose
a visual perception generalization strategy based on meta-learning, which
enables the agent to fast adapt to a new camera configuration with a few shots.
In the training phase, we first locate the generalization problem to the visual
perception module, and then compare two meta-learning algorithms for better
generalization in seen and unseen environments. One of them uses the
Model-Agnostic Meta-Learning (MAML) algorithm that requires a few shot
adaptation, and the other refers to a metric-based meta-learning method with a
feature-wise affine transformation layer. The experiment results show that our
strategy successfully adapts the learned navigation model to a new camera
configuration, and the two algorithms show their advantages in seen and unseen
environments respectively.
Related papers
- Aligning Knowledge Graph with Visual Perception for Object-goal Navigation [16.32780793344835]
We propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation.
Our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception.
The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability.
arXiv Detail & Related papers (2024-02-29T06:31:18Z) - LaViP:Language-Grounded Visual Prompts [27.57227844809257]
We introduce a language-grounded visual prompting method to adapt the visual encoder of vision-language models for downstream tasks.
By capitalizing on language integration, we devise a parameter-efficient strategy to adjust the input of the visual encoder.
Our algorithm can operate even in black-box scenarios, showcasing adaptability in situations where access to the model's parameters is constrained.
arXiv Detail & Related papers (2023-12-18T05:50:10Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - Zero Experience Required: Plug & Play Modular Transfer Learning for
Semantic Visual Navigation [97.17517060585875]
We present a unified approach to visual navigation using a novel modular transfer learning model.
Our model can effectively leverage its experience from one source task and apply it to multiple target tasks.
Our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
arXiv Detail & Related papers (2022-02-05T00:07:21Z) - Semantic Tracklets: An Object-Centric Representation for Visual
Multi-Agent Reinforcement Learning [126.57680291438128]
We study whether scalability can be achieved via a disentangled representation.
We evaluate semantic tracklets' on the visual multi-agent particle environment (VMPE) and on the challenging visual multi-agent GFootball environment.
Notably, this method is the first to successfully learn a strategy for five players in the GFootball environment using only visual data.
arXiv Detail & Related papers (2021-08-06T22:19:09Z) - ViNG: Learning Open-World Navigation with Visual Goals [82.84193221280216]
We propose a learning-based navigation system for reaching visually indicated goals.
We show that our system, which we call ViNG, outperforms previously-proposed methods for goal-conditioned reinforcement learning.
We demonstrate ViNG on a number of real-world applications, such as last-mile delivery and warehouse inspection.
arXiv Detail & Related papers (2020-12-17T18:22:32Z) - A Few Shot Adaptation of Visual Navigation Skills to New Observations
using Meta-Learning [12.771506155747893]
We introduce a learning algorithm that enables rapid adaptation to new sensor configurations or target objects with a few shots.
Our experiments show that our algorithm adapts the learned navigation policy with only three shots for unseen situations.
arXiv Detail & Related papers (2020-11-06T21:53:52Z) - MELD: Meta-Reinforcement Learning from Images via Latent State Models [109.1664295663325]
We develop an algorithm for meta-RL from images that performs inference in a latent state model to quickly acquire new skills.
MELD is the first meta-RL algorithm trained in a real-world robotic control setting from images.
arXiv Detail & Related papers (2020-10-26T23:50:30Z) - Multimodal Aggregation Approach for Memory Vision-Voice Indoor
Navigation with Meta-Learning [5.448283690603358]
We present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN)
MVV-IN receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding.
arXiv Detail & Related papers (2020-09-01T13:12:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.