Hindsight Expectation Maximization for Goal-conditioned Reinforcement
Learning
- URL: http://arxiv.org/abs/2006.07549v2
- Date: Fri, 26 Feb 2021 15:37:26 GMT
- Title: Hindsight Expectation Maximization for Goal-conditioned Reinforcement
Learning
- Authors: Yunhao Tang, Alp Kucukelbir
- Abstract summary: We propose a graphical model framework for goal-conditioned RL, with an EM algorithm that operates on the lower bound of the RL objective.
The E-step provides a natural interpretation of how 'learning in hindsight' techniques, such as HER, to handle extremely sparse goal-conditioned rewards.
The M-step reduces policy optimization to supervised learning updates, which greatly stabilizes end-to-end training on high-dimensional inputs such as images.
- Score: 26.631740480100724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a graphical model framework for goal-conditioned RL, with an EM
algorithm that operates on the lower bound of the RL objective. The E-step
provides a natural interpretation of how 'learning in hindsight' techniques,
such as HER, to handle extremely sparse goal-conditioned rewards. The M-step
reduces policy optimization to supervised learning updates, which greatly
stabilizes end-to-end training on high-dimensional inputs such as images. We
show that the combined algorithm, hEM significantly outperforms model-free
baselines on a wide range of goal-conditioned benchmarks with sparse rewards.
Related papers
- Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix [17.086679273053853]
Large Language Models (LLMs) have shown immense potential in enhancing various aspects of our daily lives.
Their growing capabilities come at the cost of extremely large model sizes, making deployment on edge devices challenging.
This paper introduces a novel approach to LLM weight pruning that directly optimize for approximating the attention matrix.
arXiv Detail & Related papers (2024-10-15T04:35:56Z) - Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification [76.14641982122696]
We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control.
We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
arXiv Detail & Related papers (2024-10-07T23:38:58Z) - Directed Exploration in Reinforcement Learning from Linear Temporal Logic [59.707408697394534]
Linear temporal logic (LTL) is a powerful language for task specification in reinforcement learning.
We show that the synthesized reward signal remains fundamentally sparse, making exploration challenging.
We show how better exploration can be achieved by further leveraging the specification and casting its corresponding Limit Deterministic B"uchi Automaton (LDBA) as a Markov reward process.
arXiv Detail & Related papers (2024-08-18T14:25:44Z) - Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts [33.58165081033569]
We introduce Sparse MetA-Tuning (SMAT), a method inspired by sparse mixture-of-experts approaches.
SMAT successfully overcomes OOD sensitivity and delivers on the promise of enhancing the transfer abilities of vision foundation models.
arXiv Detail & Related papers (2024-03-13T12:46:03Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Simplifying Model-based RL: Learning Representations, Latent-space
Models, and Policies with One Objective [142.36200080384145]
We propose a single objective which jointly optimize a latent-space model and policy to achieve high returns while remaining self-consistent.
We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods.
arXiv Detail & Related papers (2022-09-18T03:51:58Z) - Value Gradient weighted Model-Based Reinforcement Learning [28.366157882991565]
Model-based reinforcement learning (MBRL) is a sample efficient technique to obtain control policies.
VaGraM is a novel method for value-aware model learning.
arXiv Detail & Related papers (2022-04-04T13:28:31Z) - Model-Advantage Optimization for Model-Based Reinforcement Learning [41.13567626667456]
Model-based Reinforcement Learning (MBRL) algorithms have been traditionally designed with the goal of learning accurate dynamics of the environment.
Value-aware model learning, an alternative model-learning paradigm to maximum likelihood, proposes to inform model-learning through the value function of the learnt policy.
We propose a novel value-aware objective that is an upper bound on the absolute performance difference of a policy across two models.
arXiv Detail & Related papers (2021-06-26T20:01:28Z) - Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms.
We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework.
Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.