Parameter-Based Value Functions
- URL: http://arxiv.org/abs/2006.09226v4
- Date: Fri, 13 Aug 2021 14:33:27 GMT
- Title: Parameter-Based Value Functions
- Authors: Francesco Faccio, Louis Kirsch and J\"urgen Schmidhuber
- Abstract summary: Off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy.
We introduce a class of value functions called.
the-based value functions (PBVFs) whose inputs include the policy parameters.
We show how learned PBVFs can zero-shot learn new policies that outperform any policy seen during training.
- Score: 7.519872646378835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms
learn value functions of a single target policy. However, when value functions
are updated to track the learned policy, they forget potentially useful
information about old policies. We introduce a class of value functions called
Parameter-Based Value Functions (PBVFs) whose inputs include the policy
parameters. They can generalize across different policies. PBVFs can evaluate
the performance of any policy given a state, a state-action pair, or a
distribution over the RL agent's initial states. First we show how PBVFs yield
novel off-policy policy gradient theorems. Then we derive off-policy
actor-critic algorithms based on PBVFs trained by Monte Carlo or Temporal
Difference methods. We show how learned PBVFs can zero-shot learn new policies
that outperform any policy seen during training. Finally our algorithms are
evaluated on a selection of discrete and continuous control tasks using shallow
policies and deep neural networks. Their performance is comparable to
state-of-the-art methods.
Related papers
- Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs [44.69257217086967]
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives.
In this paper, we address the learning problem given linear function approximation with $q_pi$-realizability.
arXiv Detail & Related papers (2024-06-26T17:57:13Z) - Offline RL with No OOD Actions: In-Sample Learning via Implicit Value
Regularization [90.9780151608281]
In-sample learning (IQL) improves the policy by quantile regression using only data samples.
We make a key finding that the in-sample learning paradigm arises under the textitImplicit Value Regularization (IVR) framework.
We propose two practical algorithms, Sparse $Q$-learning (EQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works.
arXiv Detail & Related papers (2023-03-28T08:30:01Z) - Confidence-Conditioned Value Functions for Offline Reinforcement
Learning [86.59173545987984]
We propose a new form of Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability.
We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence.
arXiv Detail & Related papers (2022-12-08T23:56:47Z) - General Policy Evaluation and Improvement by Learning to Identify Few
But Crucial States [12.059140532198064]
Learning to evaluate and improve policies is a core problem of Reinforcement Learning.
A recently explored competitive alternative is to learn a single value function for many policies.
We show that our value function trained to evaluate NN policies is also invariant to changes of the policy architecture.
arXiv Detail & Related papers (2022-07-04T16:34:53Z) - Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm [16.115903198836694]
Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL)
This is known as "off-policy control" in RL where an agent's objective is to compute an optimal policy based on the data obtained from the given policy (known as the behavior policy)
This work proposes an off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency.
arXiv Detail & Related papers (2021-10-19T14:36:45Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - What About Inputing Policy in Value Function: Policy Representation and
Policy-extended Value Function Approximator [39.287998861631]
We study Policy-extended Value Function Approximator (PeVFA) in Reinforcement Learning (RL)
We show that generalized value estimates offered by PeVFA may have lower initial approximation error to true values of successive policies.
We propose a representation learning framework for RL policy, providing several approaches to learn effective policy embeddings from policy network parameters or state-action pairs.
arXiv Detail & Related papers (2020-10-19T14:09:18Z) - Policy Evaluation Networks [50.53250641051648]
We introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding.
Our empirical results demonstrate that combining these three elements can produce policies that outperform those that generated the training data.
arXiv Detail & Related papers (2020-02-26T23:00:27Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.