General Policy Evaluation and Improvement by Learning to Identify Few
But Crucial States
- URL: http://arxiv.org/abs/2207.01566v1
- Date: Mon, 4 Jul 2022 16:34:53 GMT
- Title: General Policy Evaluation and Improvement by Learning to Identify Few
But Crucial States
- Authors: Francesco Faccio, Aditya Ramesh, Vincent Herrmann, Jean Harb, J\"urgen
Schmidhuber
- Abstract summary: Learning to evaluate and improve policies is a core problem of Reinforcement Learning.
A recently explored competitive alternative is to learn a single value function for many policies.
We show that our value function trained to evaluate NN policies is also invariant to changes of the policy architecture.
- Score: 12.059140532198064
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning to evaluate and improve policies is a core problem of Reinforcement
Learning (RL). Traditional RL algorithms learn a value function defined for a
single policy. A recently explored competitive alternative is to learn a single
value function for many policies. Here we combine the actor-critic architecture
of Parameter-Based Value Functions and the policy embedding of Policy
Evaluation Networks to learn a single value function for evaluating (and thus
helping to improve) any policy represented by a deep neural network (NN). The
method yields competitive experimental results. In continuous control problems
with infinitely many states, our value function minimizes its prediction error
by simultaneously learning a small set of `probing states' and a mapping from
actions produced in probing states to the policy's return. The method extracts
crucial abstract knowledge about the environment in form of very few states
sufficient to fully specify the behavior of many policies. A policy improves
solely by changing actions in probing states, following the gradient of the
value function's predictions. Surprisingly, it is possible to clone the
behavior of a near-optimal policy in Swimmer-v3 and Hopper-v3 environments only
by knowing how to act in 3 and 5 such learned states, respectively. Remarkably,
our value function trained to evaluate NN policies is also invariant to changes
of the policy architecture: we show that it allows for zero-shot learning of
linear policies competitive with the best policy seen during training. Our code
is public.
Related papers
- Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm [16.115903198836694]
Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL)
This is known as "off-policy control" in RL where an agent's objective is to compute an optimal policy based on the data obtained from the given policy (known as the behavior policy)
This work proposes an off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency.
arXiv Detail & Related papers (2021-10-19T14:36:45Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Parameter-Based Value Functions [7.519872646378835]
Off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy.
We introduce a class of value functions called.
the-based value functions (PBVFs) whose inputs include the policy parameters.
We show how learned PBVFs can zero-shot learn new policies that outperform any policy seen during training.
arXiv Detail & Related papers (2020-06-16T15:04:49Z) - Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL)
We propose Zeroth-Order Supervised Policy Improvement (ZOSPI)
ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z) - Policy Evaluation Networks [50.53250641051648]
We introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding.
Our empirical results demonstrate that combining these three elements can produce policies that outperform those that generated the training data.
arXiv Detail & Related papers (2020-02-26T23:00:27Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.