Related papers: Embodied Navigation with Auxiliary Task of Action Description Prediction

Embodied Navigation with Auxiliary Task of Action Description Prediction

URL: http://arxiv.org/abs/2510.21809v1
Date: Tue, 21 Oct 2025 09:12:22 GMT
Title: Embodied Navigation with Auxiliary Task of Action Description Prediction
Authors: Haru Kondoh, Asako Kanezaki,
Abstract summary: We propose incorporating the task of describing actions in language into the reinforcement learning of navigation as an auxiliary task.<n>Existing studies have found it difficult to incorporate describing actions into reinforcement learning due to the absence of ground-truth data.<n>We evaluate our approach across various navigation tasks, demonstrating that it can describe actions while attaining high navigation performance.
Score: 6.558761304336893
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The field of multimodal robot navigation in indoor environments has garnered significant attention in recent years. However, as tasks and methods become more advanced, the action decision systems tend to become more complex and operate as black-boxes. For a reliable system, the ability to explain or describe its decisions is crucial; however, there tends to be a trade-off in that explainable systems can not outperform non-explainable systems in terms of performance. In this paper, we propose incorporating the task of describing actions in language into the reinforcement learning of navigation as an auxiliary task. Existing studies have found it difficult to incorporate describing actions into reinforcement learning due to the absence of ground-truth data. We address this issue by leveraging knowledge distillation from pre-trained description generation models, such as vision-language models. We comprehensively evaluate our approach across various navigation tasks, demonstrating that it can describe actions while attaining high navigation performance. Furthermore, it achieves state-of-the-art performance in the particularly challenging multimodal navigation task of semantic audio-visual navigation.

Related papers

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts [54.11162991206203]
This paper consolidates diverse navigation tasks into a unified and generic framework.<n>We propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions.
arXiv Detail & Related papers (2024-12-07T06:12:53Z)
Hierarchical end-to-end autonomous navigation through few-shot waypoint detection [0.0]
Human navigation is facilitated through the association of actions with landmarks. Current autonomous navigation schemes rely on accurate positioning devices and algorithms as well as extensive streams of sensory data collected from the environment. We propose a hierarchical end-to-end meta-learning scheme that enables a mobile robot to navigate in a previously unknown environment.
arXiv Detail & Related papers (2024-09-23T00:03:39Z)
DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research. We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair. We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z)
TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation [11.591176410027224]
This paper presents a Vision-Language Navigation (VLN) agent based on Large Language Models (LLMs) We propose the Thinking, Interacting, and Action framework to compensate for the shortcomings of LLMs in environmental perception. Our approach also outperformed some supervised learning-based methods, highlighting its efficacy in zero-shot navigation.
arXiv Detail & Related papers (2024-03-13T05:22:39Z)
Towards Versatile Embodied Navigation [120.73460380993305]
Vienna is a versatile embodied navigation agent that simultaneously learns to perform the four navigation tasks with one model. We empirically demonstrate that, compared with learning each visual navigation task individually, our agent achieves comparable or even better performance with reduced complexity.
arXiv Detail & Related papers (2022-10-30T11:53:49Z)
Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation [145.84123197129298]
Language instruction plays an essential role in the natural language grounded navigation tasks. We exploit to train a more robust navigator which is capable of dynamically extracting crucial factors from the long instruction. Specifically, we propose a Dynamic Reinforced Instruction Attacker (DR-Attacker), which learns to mislead the navigator to move to the wrong target.
arXiv Detail & Related papers (2021-07-23T14:11:31Z)
Deep Learning for Embodied Vision Navigation: A Survey [108.13766213265069]
"Embodied visual navigation" problem requires an agent to navigate in a 3D environment mainly rely on its first-person observation. This paper attempts to establish an outline of the current works in the field of embodied visual navigation by providing a comprehensive literature survey.
arXiv Detail & Related papers (2021-07-07T12:09:04Z)
Building Intelligent Autonomous Navigation Agents [18.310643564200525]
The goal of this thesis is to make progress towards designing algorithms capable of physical intelligence' In the first part of the thesis, we discuss our work on short-term navigation using end-to-end reinforcement learning. In the second part, we present a new class of navigation methods based on modular learning and structured explicit map representations.
arXiv Detail & Related papers (2021-06-25T04:10:58Z)
Active Visual Information Gathering for Vision-Language Navigation [115.40768457718325]
Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments. One of the key challenges in VLN is how to conduct a robust navigation by mitigating the uncertainty caused by ambiguous instructions and insufficient observation of the environment. This work draws inspiration from human navigation behavior and endows an agent with an active information gathering ability for a more intelligent VLN policy.
arXiv Detail & Related papers (2020-07-15T23:54:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.