Related papers: Guided Cooperation in Hierarchical Reinforcement Learning via Model-based Rollout

Guided Cooperation in Hierarchical Reinforcement Learning via Model-based Rollout

URL: http://arxiv.org/abs/2309.13508v2
Date: Sat, 6 Apr 2024 17:07:13 GMT
Title: Guided Cooperation in Hierarchical Reinforcement Learning via Model-based Rollout
Authors: Haoran Wang, Zeshen Tang, Leya Yang, Yaoru Sun, Fang Wang, Siyu Zhang, Yeming Chen,
Abstract summary: We propose a goal-conditioned hierarchical reinforcement learning (HRL) framework named Guided Cooperation via Model-based Rollout (GCMR) GCMR aims to bridge inter-layer information synchronization and cooperation by exploiting forward dynamics. Experimental results demonstrate that incorporating the proposed GCMR framework with a disentangled variant of HIGL, namely ACLG, yields more stable and robust policy improvement.
Score: 16.454305212398328
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Goal-conditioned hierarchical reinforcement learning (HRL) presents a promising approach for enabling effective exploration in complex, long-horizon reinforcement learning (RL) tasks through temporal abstraction. Empirically, heightened inter-level communication and coordination can induce more stable and robust policy improvement in hierarchical systems. Yet, most existing goal-conditioned HRL algorithms have primarily focused on the subgoal discovery, neglecting inter-level cooperation. Here, we propose a goal-conditioned HRL framework named Guided Cooperation via Model-based Rollout (GCMR), aiming to bridge inter-layer information synchronization and cooperation by exploiting forward dynamics. Firstly, the GCMR mitigates the state-transition error within off-policy correction via model-based rollout, thereby enhancing sample efficiency. Secondly, to prevent disruption by the unseen subgoals and states, lower-level Q-function gradients are constrained using a gradient penalty with a model-inferred upper bound, leading to a more stable behavioral policy conducive to effective exploration. Thirdly, we propose a one-step rollout-based planning, using higher-level critics to guide the lower-level policy. Specifically, we estimate the value of future states of the lower-level policy using the higher-level critic function, thereby transmitting global task information downwards to avoid local pitfalls. These three critical components in GCMR are expected to facilitate inter-level cooperation significantly. Experimental results demonstrate that incorporating the proposed GCMR framework with a disentangled variant of HIGL, namely ACLG, yields more stable and robust policy improvement compared to various baselines and significantly outperforms previous state-of-the-art algorithms.

Related papers

Hierarchical Reinforcement Learning with Uncertainty-Guided Diffusional Subgoals [12.894271401094615]
A key challenge in HRL is that the low-level policy changes over time, making it difficult for the high-level policy to generate effective subgoals.<n>We propose an approach that trains a conditional diffusion model regularized by a Gaussian Process (GP) prior to generate a complex variety of subgoals.<n>Building on this framework, we develop a strategy that selects subgoals from both the diffusion policy and GP's predictive mean.
arXiv Detail & Related papers (2025-05-27T20:38:44Z)
On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning [50.856589224454055]
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs)<n>We propose regularized policy gradient (RPG), a framework for deriving and analyzing KL-regularized policy gradient methods in the online reinforcement learning setting.<n>RPG shows improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO.
arXiv Detail & Related papers (2025-05-23T06:01:21Z)
COMBO-Grasp: Learning Constraint-Based Manipulation for Bimanual Occluded Grasping [56.907940167333656]
Occluded robot grasping is where the desired grasp poses are kinematically infeasible due to environmental constraints such as surface collisions. Traditional robot manipulation approaches struggle with the complexity of non-prehensile or bimanual strategies commonly used by humans. We introduce Constraint-based Manipulation for Bimanual Occluded Grasping (COMBO-Grasp), a learning-based approach which leverages two coordinated policies.
arXiv Detail & Related papers (2025-02-12T01:31:01Z)
Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions. We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z)
Bidirectional-Reachable Hierarchical Reinforcement Learning with Mutually Responsive Policies [26.915223518488016]
Hierarchical reinforcement learning (HRL) addresses complex long-horizon tasks by decomposing them into subgoals. We propose the Bidirectional-reachable Hierarchical Policy Optimization (BrHPO), a simple yet effective algorithm that also enjoys computation efficiency. Experiment results on a variety of long-horizon tasks showcase that BrHPO outperforms other state-of-the-art HRL baselines, coupled with a significantly higher exploration efficiency and robustness.
arXiv Detail & Related papers (2024-06-26T04:05:04Z)
Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy. The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms. We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Rethinking Decision Transformer via Hierarchical Reinforcement Learning [54.3596066989024]
Decision Transformer (DT) is an innovative algorithm leveraging recent advances of the transformer architecture in reinforcement learning (RL) We introduce a general sequence modeling framework for studying sequential decision making through the lens of Hierarchical RL. We show DT emerges as a special case of this framework with certain choices of high-level and low-level policies, and discuss the potential failure of these choices.
arXiv Detail & Related papers (2023-11-01T03:32:13Z)
Adversarially Guided Subgoal Generation for Hierarchical Reinforcement Learning [5.514236598436977]
We propose a novel HRL approach for mitigating the non-stationarity by adversarially enforcing the high-level policy to generate subgoals compatible with the current instantiation of the low-level policy. Experiments with state-of-the-art algorithms show that our approach significantly improves learning efficiency and overall performance of HRL in various challenging continuous control tasks.
arXiv Detail & Related papers (2022-01-24T12:30:38Z)
Adjacency constraint for efficient hierarchical reinforcement learning [25.15808501708926]
Goal-conditioned Reinforcement Learning (HRL) is a promising approach for scaling up reinforcement learning (RL) techniques. HRL often suffers from training inefficiency as the action space of the high-level, i.e., the goal space, is large. We show that this problem can be effectively alleviated by restricting the high-level action space to a $k$-step adjacent region of the current state.
arXiv Detail & Related papers (2021-10-30T09:26:45Z)
Provable Hierarchy-Based Meta-Reinforcement Learning [50.17896588738377]
We analyze HRL in the meta-RL setting, where learner learns latent hierarchical structure during meta-training for use in a downstream task. We provide "diversity conditions" which, together with a tractable optimism-based algorithm, guarantee sample-efficient recovery of this natural hierarchy. Our bounds incorporate common notions in HRL literature such as temporal and state/action abstractions, suggesting that our setting and analysis capture important features of HRL in practice.
arXiv Detail & Related papers (2021-10-18T17:56:02Z)
Hierarchical Reinforcement Learning with Optimal Level Synchronization based on a Deep Generative Model [4.266866385061998]
One of the HRL issues is how to train each level policy with the optimal data collection from its experience. We propose a novel HRL model supporting the optimal level synchronization using the off-policy correction technique with a deep generative model.
arXiv Detail & Related papers (2021-07-17T05:02:25Z)
The Gradient Convergence Bound of Federated Multi-Agent Reinforcement Learning with Efficient Communication [20.891460617583302]
The paper considers independent reinforcement learning (IRL) for collaborative decision-making in the paradigm of federated learning (FL) FL generates excessive communication overheads between agents and a remote central server. This paper proposes two advanced optimization schemes to improve the system's utility value.
arXiv Detail & Related papers (2021-03-24T07:21:43Z)
Optimization-driven Deep Reinforcement Learning for Robust Beamforming in IRS-assisted Wireless Communications [54.610318402371185]
Intelligent reflecting surface (IRS) is a promising technology to assist downlink information transmissions from a multi-antenna access point (AP) to a receiver. We minimize the AP's transmit power by a joint optimization of the AP's active beamforming and the IRS's passive beamforming. We propose a deep reinforcement learning (DRL) approach that can adapt the beamforming strategies from past experiences.
arXiv Detail & Related papers (2020-05-25T01:42:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.