Related papers: Vision-Language Navigation with Energy-Based Policy

Vision-Language Navigation with Energy-Based Policy

URL: http://arxiv.org/abs/2410.14250v1
Date: Fri, 18 Oct 2024 08:01:36 GMT
Title: Vision-Language Navigation with Energy-Based Policy
Authors: Rui Liu, Wenguan Wang, Yi Yang,
Abstract summary: Vision-language navigation (VLN) requires an agent to execute actions following human instructions. We propose an Energy-based Navigation Policy (ENP) to model the joint state-action distribution. ENP achieves promising performances on R2R, REVERIE, RxR, and R2R-CE.
Score: 66.04379819772764
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language navigation (VLN) requires an agent to execute actions following human instructions. Existing VLN models are optimized through expert demonstrations by supervised behavioural cloning or incorporating manual reward engineering. While straightforward, these efforts overlook the accumulation of errors in the Markov decision process, and struggle to match the distribution of the expert policy. Going beyond this, we propose an Energy-based Navigation Policy (ENP) to model the joint state-action distribution using an energy-based model. At each step, low energy values correspond to the state-action pairs that the expert is most likely to perform, and vice versa. Theoretically, the optimization objective is equivalent to minimizing the forward divergence between the occupancy measure of the expert and ours. Consequently, ENP learns to globally align with the expert policy by maximizing the likelihood of the actions and modeling the dynamics of the navigation states in a collaborative manner. With a variety of VLN architectures, ENP achieves promising performances on R2R, REVERIE, RxR, and R2R-CE, unleashing the power of existing VLN models.

Related papers

ORPR: An OR-Guided Pretrain-then-Reinforce Learning Model for Inventory Management [9.138155308817215]
"Pretrain-then-Reinforce" approach reconciles AI's adaptive perception with Operations Research's structural rigor.<n>We show that a lightweight, domain-informed model can deliver state-of-the-art performance and robust transferability when guided by structured OR logic.
arXiv Detail & Related papers (2025-12-22T03:39:43Z)
ReLaX: Reasoning with Latent Exploration for Large Reasoning Models [11.506415241741601]
We argue that latent dynamics underlying token generation encode a far richer computational structure for steering policy optimization.<n>We propose Reasoning with Latent eXploration (ReLaX), a paradigm that explicitly incorporates latent dynamics to regulate exploration and exploitation.
arXiv Detail & Related papers (2025-12-08T13:48:33Z)
Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation [52.11339614452127]
Vision-and-Language Navigation (VLN) requires an agent to dynamically explore complex 3D environments following human instructions.<n>Recent research underscores the potential of harnessing large language models (LLMs) for VLN, given their commonsense knowledge and general reasoning capabilities.<n>We propose a novel dual-process thinking framework dubbed R3, integrating LLMs' generalization capabilities with VLN-specific expertise in a zero-shot manner.
arXiv Detail & Related papers (2025-11-18T04:32:00Z)
Exploration from a Primal-Dual Lens: Value-Incentivized Actor-Critic Methods for Sample-Efficient Online RL [40.05960121330012]
Online reinforcement learning (RL) with complex function approximations plays a significant role in the modern practice of artificial intelligence.<n> balancing the fundamental trade-off between exploration and exploitation remains a long-standing challenge.<n>This paper provides an interpretation of the principle of optimism through the lens of primal-dual optimization.
arXiv Detail & Related papers (2025-06-27T17:18:43Z)
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning [77.34267241692706]
Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.<n>We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions.
arXiv Detail & Related papers (2025-06-20T17:59:59Z)
Policy Regularization on Globally Accessible States in Cross-Dynamics Reinforcement Learning [53.9544543607396]
We propose a novel framework that integrates reward rendering with Imitation from Observation (IfO) By instantiating F-distance in different ways, we derive two theoretical analysis and develop a practical algorithm called Accessible State Oriented Policy Regularization (ASOR) ASOR serves as a general add-on module that can be incorporated into various approaches RL, including offline RL and off-policy RL.
arXiv Detail & Related papers (2025-03-10T03:50:20Z)
Operator World Models for Reinforcement Learning [37.69110422996011]
Policy Mirror Descent (PMD) is a powerful and theoretically sound methodology for sequential decision-making. It is not directly applicable to Reinforcement Learning (RL) due to the inaccessibility of explicit action-value functions. We introduce a novel approach based on learning a world model of the environment using conditional mean embeddings.
arXiv Detail & Related papers (2024-06-28T12:05:47Z)
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF [82.7679132059169]
Reinforcement learning from human feedback has emerged as a central tool for language model alignment. We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO) XPO enjoys the strongest known provable guarantees and promising empirical performance.
arXiv Detail & Related papers (2024-05-31T17:39:06Z)
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [101.56342075720588]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions. Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability. This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints [26.274786600234876]
The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but amplify safety concerns. RLHF has emerged as a promising pathway towards AI alignment but brings forth challenges due to its complexity and dependence on a separate reward model. DPO has been proposed as an alternative, and it remains equivalent to RLHF under the reverse KL regularization constraint. We show that under certain $f$-divergences, including Jensen-Shannon divergence, forward KL divergences and $alpha$-divergences, the complex relationship between the reward and optimal policy can also be simplified
arXiv Detail & Related papers (2023-09-28T08:29:44Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
Soft Expert Reward Learning for Vision-and-Language Navigation [94.86954695912125]
Vision-and-Language Navigation (VLN) requires an agent to find a specified spot in an unseen environment by following natural language instructions. We introduce a Soft Expert Reward Learning (SERL) model to overcome the reward engineering designing and generalisation problems of the VLN task.
arXiv Detail & Related papers (2020-07-21T14:17:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.