Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation
- URL: http://arxiv.org/abs/2210.07506v1
- Date: Fri, 14 Oct 2022 04:23:27 GMT
- Title: Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation
- Authors: Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H. Li,
Mingkui Tan, Chuang Gan
- Abstract summary: We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions.
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
- Score: 87.52136927091712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address a practical yet challenging problem of training robot agents to
navigate in an environment following a path described by some language
instructions. The instructions often contain descriptions of objects in the
environment. To achieve accurate and efficient navigation, it is critical to
build a map that accurately represents both spatial location and the semantic
information of the environment objects. However, enabling a robot to build a
map that well represents the environment is extremely challenging as the
environment often involves diverse objects with various attributes. In this
paper, we propose a multi-granularity map, which contains both object
fine-grained details (e.g., color, texture) and semantic classes, to represent
objects more comprehensively. Moreover, we propose a weakly-supervised
auxiliary task, which requires the agent to localize instruction-relevant
objects on the map. Through this task, the agent not only learns to localize
the instruction-relevant objects for navigation but also is encouraged to learn
a better map representation that reveals object information. We then feed the
learned map and instruction to a waypoint predictor to determine the next
navigation goal. Experimental results show our method outperforms the
state-of-the-art by 4.0% and 4.6% w.r.t. success rate both in seen and unseen
environments, respectively on VLN-CE dataset. Code is available at
https://github.com/PeihaoChen/WS-MGMap.
Related papers
- IVLMap: Instance-Aware Visual Language Grounding for Consumer Robot Navigation [10.006058028927907]
Vision-and-Language Navigation (VLN) is a challenging task that requires a robot to navigate in photo-realistic environments with human natural language promptings.
Recent studies aim to handle this task by constructing the semantic spatial map representation of the environment.
We propose a new method, namely, Instance-aware Visual Language Map (IVLMap), to empower the robot with instance-level and attribute-level semantic mapping.
arXiv Detail & Related papers (2024-03-28T11:52:42Z) - Mapping High-level Semantic Regions in Indoor Environments without
Object Recognition [50.624970503498226]
The present work proposes a method for semantic region mapping via embodied navigation in indoor environments.
To enable region identification, the method uses a vision-to-language model to provide scene information for mapping.
By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location.
arXiv Detail & Related papers (2024-03-11T18:09:50Z) - Object Goal Navigation with Recursive Implicit Maps [92.6347010295396]
We propose an implicit spatial map for object goal navigation.
Our method significantly outperforms the state of the art on the challenging MP3D dataset.
We deploy our model on a real robot and achieve encouraging object goal navigation results in real scenes.
arXiv Detail & Related papers (2023-08-10T14:21:33Z) - Learning Navigational Visual Representations with Semantic Map
Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps.
Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z) - Structured Exploration Through Instruction Enhancement for Object
Navigation [0.0]
We propose a hierarchical learning-based method for object navigation.
The top-level is capable of high-level planning, and building a memory on a floorplan-level.
We demonstrate the effectiveness of our method on a dynamic domestic environment.
arXiv Detail & Related papers (2022-11-15T19:39:22Z) - SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots.
Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step.
This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z) - ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in
Dynamic Environments [85.81157224163876]
We combine Vision-and-Language Navigation, assembling of collected objects, and object referring expression comprehension, to create a novel joint navigation-and-assembly task, named ArraMon.
During this task, the agent is asked to find and collect different target objects one-by-one by navigating based on natural language instructions in a complex, realistic outdoor environment.
We present results for several baseline models (integrated and biased) and metrics (nDTW, CTC, rPOD, and PTC), and the large model-human performance gap demonstrates that our task is challenging and presents a wide scope for future work.
arXiv Detail & Related papers (2020-11-15T23:30:36Z) - Extending Maps with Semantic and Contextual Object Information for Robot
Navigation: a Learning-Based Framework using Visual and Depth Cues [12.984393386954219]
This paper addresses the problem of building augmented metric representations of scenes with semantic information from RGB-D images.
We propose a complete framework to create an enhanced map representation of the environment with object-level information.
arXiv Detail & Related papers (2020-03-13T15:05:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.