Multimodal Aggregation Approach for Memory Vision-Voice Indoor
Navigation with Meta-Learning
- URL: http://arxiv.org/abs/2009.00402v1
- Date: Tue, 1 Sep 2020 13:12:27 GMT
- Title: Multimodal Aggregation Approach for Memory Vision-Voice Indoor
Navigation with Meta-Learning
- Authors: Liqi Yan and Dongfang Liu and Yaoxian Song and Changbin Yu
- Abstract summary: We present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN)
MVV-IN receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding.
- Score: 5.448283690603358
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision and voice are two vital keys for agents' interaction and learning. In
this paper, we present a novel indoor navigation model called Memory
Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and
analyzes multimodal information of visual observation in order to enhance
robots' environment understanding. We make use of single RGB images taken by a
first-view monocular camera. We also apply a self-attention mechanism to keep
the agent focusing on key areas. Memory is important for the agent to avoid
repeating certain tasks unnecessarily and in order for it to adapt adequately
to new scenes, therefore, we make use of meta-learning. We have experimented
with various functional features extracted from visual observation. Comparative
experiments prove that our methods outperform state-of-the-art baselines.
Related papers
- Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - VANP: Learning Where to See for Navigation with Self-Supervised Vision-Action Pre-Training [8.479135285935113]
Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation.
Most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects.
We propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP)
arXiv Detail & Related papers (2024-03-12T22:33:08Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - Learning Navigational Visual Representations with Semantic Map
Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps.
Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z) - Towards self-attention based visual navigation in the real world [0.0]
Vision guided navigation requires processing complex visual information to inform task-orientated decisions.
Deep Reinforcement Learning agents trained in simulation often exhibit unsatisfactory results when deployed in the real-world.
This is the first demonstration of a self-attention based agent successfully trained in navigating a 3D action space, using less than 4000 parameters.
arXiv Detail & Related papers (2022-09-15T04:51:42Z) - Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning.
We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z) - Diagnosing Vision-and-Language Navigation: What Really Matters [61.72935815656582]
Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments.
Recent studies witness a slow-down in the performance improvements in both indoor and outdoor VLN tasks.
In this work, we conduct a series of diagnostic experiments to unveil agents' focus during navigation.
arXiv Detail & Related papers (2021-03-30T17:59:07Z) - Visual Perception Generalization for Vision-and-Language Navigation via
Meta-Learning [9.519596058757033]
Vision-and-language navigation (VLN) is a challenging task that requires an agent to navigate in real-world environments by understanding natural language instructions and visual information received in real-time.
We propose a visual perception generalization strategy based on meta-learning, which enables the agent to fast adapt to a new camera configuration with a few shots.
arXiv Detail & Related papers (2020-12-10T04:10:04Z) - Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source.
Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations.
We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z) - Vision-Dialog Navigation by Exploring Cross-modal Memory [107.13970721435571]
Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets.
We propose the Cross-modal Memory Network (CMN) for remembering and understanding the rich information relevant to historical navigation actions.
Our CMN outperforms the previous state-of-the-art model by a significant margin on both seen and unseen environments.
arXiv Detail & Related papers (2020-03-15T03:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.