Related papers: Parameter-Based Value Functions

Parameter-Based Value Functions

URL: http://arxiv.org/abs/2006.09226v4
Date: Fri, 13 Aug 2021 14:33:27 GMT
Title: Parameter-Based Value Functions
Authors: Francesco Faccio, Louis Kirsch and J\"urgen Schmidhuber
Abstract summary: Off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. We introduce a class of value functions called. the-based value functions (PBVFs) whose inputs include the policy parameters. We show how learned PBVFs can zero-shot learn new policies that outperform any policy seen during training.
Score: 7.519872646378835
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters. They can generalize across different policies. PBVFs can evaluate the performance of any policy given a state, a state-action pair, or a distribution over the RL agent's initial states. First we show how PBVFs yield novel off-policy policy gradient theorems. Then we derive off-policy actor-critic algorithms based on PBVFs trained by Monte Carlo or Temporal Difference methods. We show how learned PBVFs can zero-shot learn new policies that outperform any policy seen during training. Finally our algorithms are evaluated on a selection of discrete and continuous control tasks using shallow policies and deep neural networks. Their performance is comparable to state-of-the-art methods.

Related papers

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone [72.17534881026995]
We develop an offline and online fine-tuning approach called policy-agnostic RL (PA-RL) We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm.
arXiv Detail & Related papers (2024-12-09T17:28:03Z)
Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs [44.69257217086967]
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives. In this paper, we address the learning problem given linear function approximation with $q_pi$-realizability.
arXiv Detail & Related papers (2024-06-26T17:57:13Z)
Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization [90.9780151608281]
In-sample learning (IQL) improves the policy by quantile regression using only data samples. We make a key finding that the in-sample learning paradigm arises under the textitImplicit Value Regularization (IVR) framework. We propose two practical algorithms, Sparse $Q$-learning (EQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works.
arXiv Detail & Related papers (2023-03-28T08:30:01Z)
Confidence-Conditioned Value Functions for Offline Reinforcement Learning [86.59173545987984]
We propose a new form of Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence.
arXiv Detail & Related papers (2022-12-08T23:56:47Z)
General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States [12.059140532198064]
Learning to evaluate and improve policies is a core problem of Reinforcement Learning. A recently explored competitive alternative is to learn a single value function for many policies. We show that our value function trained to evaluate NN policies is also invariant to changes of the policy architecture.
arXiv Detail & Related papers (2022-07-04T16:34:53Z)
Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm [16.115903198836694]
Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL) This is known as "off-policy control" in RL where an agent's objective is to compute an optimal policy based on the data obtained from the given policy (known as the behavior policy) This work proposes an off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency.
arXiv Detail & Related papers (2021-10-19T14:36:45Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy. We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance. Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z)
What About Inputing Policy in Value Function: Policy Representation and Policy-extended Value Function Approximator [39.287998861631]
We study Policy-extended Value Function Approximator (PeVFA) in Reinforcement Learning (RL) We show that generalized value estimates offered by PeVFA may have lower initial approximation error to true values of successive policies. We propose a representation learning framework for RL policy, providing several approaches to learn effective policy embeddings from policy network parameters or state-action pairs.
arXiv Detail & Related papers (2020-10-19T14:09:18Z)
Policy Evaluation Networks [50.53250641051648]
We introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding. Our empirical results demonstrate that combining these three elements can produce policies that outperform those that generated the training data.
arXiv Detail & Related papers (2020-02-26T23:00:27Z)
BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy. We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.