Related papers: Model Based Meta Learning of Critics for Policy Gradients

Model Based Meta Learning of Critics for Policy Gradients

URL: http://arxiv.org/abs/2204.02210v1
Date: Tue, 5 Apr 2022 13:43:12 GMT
Title: Model Based Meta Learning of Critics for Policy Gradients
Authors: Sarah Bechtle, Ludovic Righetti, Franziska Meier
Abstract summary: We present a framework to meta-learn the critic for gradient-based policy learning. Our algorithm leads to learned critics that resemble the ground truth Q function for a given task. After meta-training, the learned critic can be used to learn new policies for new unseen task and environment settings.
Score: 19.431964785397717
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Being able to seamlessly generalize across different tasks is fundamental for robots to act in our world. However, learning representations that generalize quickly to new scenarios is still an open research problem in reinforcement learning. In this paper we present a framework to meta-learn the critic for gradient-based policy learning. Concretely, we propose a model-based bi-level optimization algorithm that updates the critics parameters such that the policy that is learned with the updated critic gets closer to solving the meta-training tasks. We illustrate that our algorithm leads to learned critics that resemble the ground truth Q function for a given task. Finally, after meta-training, the learned critic can be used to learn new policies for new unseen task and environment settings via model-free policy gradient optimization, without requiring a model. We present results that show the generalization capabilities of our learned critic to new tasks and dynamics when used to learn a new policy in a new scenario.

Related papers

Robust Finetuning of Vision-Language-Action Robot Policies via Parameter Merging [53.41119829581115]
Generalist robot policies, trained on large and diverse datasets, have demonstrated the ability to generalize.<n>They still fall short on new tasks not covered in the training data.<n>We develop a method that preserves the generalization capabilities of the generalist policy during finetuning.
arXiv Detail & Related papers (2025-12-09T08:02:11Z)
Functional Critic Modeling for Provably Convergent Off-Policy Actor-Critic [29.711769434073755]
We introduce a novel concept of functional critic modeling, which leads to a new AC framework.<n>We provide a theoretical analysis in the linear function setting, establishing the provable convergence of our framework.
arXiv Detail & Related papers (2025-09-26T21:55:26Z)
Probabilistic Curriculum Learning for Goal-Based Reinforcement Learning [2.5352713493505785]
Reinforcement learning -- algorithms that teach artificial agents to interact with environments by maximising reward signals -- has achieved significant success in recent years. One promising research direction involves introducing goals to allow multimodal policies, commonly through hierarchical or curriculum reinforcement learning. We present a novel probabilistic curriculum learning algorithm to suggest goals for reinforcement learning agents in continuous control and navigation tasks.
arXiv Detail & Related papers (2025-04-02T08:15:16Z)
Residual Q-Learning: Offline and Online Policy Customization without Value [53.47311900133564]
Imitation Learning (IL) is a widely used framework for learning imitative behavior from demonstrations. We formulate a new problem setting called policy customization. We propose a novel framework, Residual Q-learning, which can solve the formulated MDP by leveraging the prior policy.
arXiv Detail & Related papers (2023-06-15T22:01:19Z)
Predictive Experience Replay for Continual Visual Control and Forecasting [62.06183102362871]
We present a new continual learning approach for visual dynamics modeling and explore its efficacy in visual control and forecasting. We first propose the mixture world model that learns task-specific dynamics priors with a mixture of Gaussians, and then introduce a new training strategy to overcome catastrophic forgetting. Our model remarkably outperforms the naive combinations of existing continual learning and visual RL algorithms on DeepMind Control and Meta-World benchmarks with continual visual control tasks.
arXiv Detail & Related papers (2023-03-12T05:08:03Z)
On the Effectiveness of Fine-tuning Versus Meta-reinforcement Learning [71.55412580325743]
We show that multi-task pretraining with fine-tuning on new tasks performs equally as well, or better, than meta-pretraining with meta test-time adaptation. This is encouraging for future research, as multi-task pretraining tends to be simpler and computationally cheaper than meta-RL.
arXiv Detail & Related papers (2022-06-07T13:24:00Z)
Dynamic Regret Analysis for Online Meta-Learning [0.0]
The online meta-learning framework has arisen as a powerful tool for the continual lifelong learning setting. This formulation involves two levels: outer level which learns meta-learners and inner level which learns task-specific models. We establish performance in terms of dynamic regret which handles changing environments from a global prospective. We carry out our analyses in a setting, and in expectation prove a logarithmic local dynamic regret which explicitly depends on the total number of iterations.
arXiv Detail & Related papers (2021-09-29T12:12:59Z)
Model-based Meta Reinforcement Learning using Graph Structured Surrogate Models [40.08137765886609]
We show that our model, called a graph structured surrogate model (GSSM), outperforms state-of-the-art methods in predicting environment dynamics. Our approach is able to obtain high returns, while allowing fast execution during deployment by avoiding test time policy gradient optimization.
arXiv Detail & Related papers (2021-02-16T17:21:55Z)
Importance Weighted Policy Learning and Adaptation [89.46467771037054]
We study a complementary approach which is conceptually simple, general, modular and built on top of recent improvements in off-policy learning. The framework is inspired by ideas from the probabilistic inference literature and combines robust off-policy learning with a behavior prior. Our approach achieves competitive adaptation performance on hold-out tasks compared to meta reinforcement learning baselines and can scale to complex sparse-reward scenarios.
arXiv Detail & Related papers (2020-09-10T14:16:58Z)
Inverse Reinforcement Learning from a Gradient-based Learner [41.8663538249537]
Inverse Reinforcement Learning addresses the problem of inferring an expert's reward function from demonstrations. In this paper, we propose a new algorithm for this setting, in which the goal is to recover the reward function being optimized by an agent.
arXiv Detail & Related papers (2020-07-15T16:41:00Z)
Meta-Reinforcement Learning Robust to Distributional Shift via Model Identification and Experience Relabeling [126.69933134648541]
We present a meta-reinforcement learning algorithm that is both efficient and extrapolates well when faced with out-of-distribution tasks at test time. Our method is based on a simple insight: we recognize that dynamics models can be adapted efficiently and consistently with off-policy data.
arXiv Detail & Related papers (2020-06-12T13:34:46Z)
Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization [100.72335252255989]
We study the problem of learning exploration-exploitation strategies that effectively adapt to dynamic environments. We propose a novel algorithm that regularizes the training of an RNN-based policy using informed policies trained to maximize the reward in each task.
arXiv Detail & Related papers (2020-05-06T16:14:48Z)
How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization [10.424426548124696]
We propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients. MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning. We demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.
arXiv Detail & Related papers (2020-04-29T16:30:53Z)
Online Meta-Critic Learning for Off-Policy Actor-Critic Methods [107.98781730288897]
Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety of continuous control tasks. We introduce a novel and flexible meta-critic that observes the learning process and meta-learns an additional loss for the actor.
arXiv Detail & Related papers (2020-03-11T14:39:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.