Understanding the Pathologies of Approximate Policy Evaluation when
Combined with Greedification in Reinforcement Learning
- URL: http://arxiv.org/abs/2010.15268v1
- Date: Wed, 28 Oct 2020 22:57:57 GMT
- Title: Understanding the Pathologies of Approximate Policy Evaluation when
Combined with Greedification in Reinforcement Learning
- Authors: Kenny Young and Richard S. Sutton
- Abstract summary: Theory of reinforcement learning with value function approximation remains fundamentally incomplete.
Prior work has identified a variety of pathological behaviours that arise in RL algorithms that combine approximate on-policy evaluation and greedification.
We present examples illustrating that in addition to policy oscillation and multiple fixed points -- the same basic issue can lead to convergence to the worst possible policy for a given approximation.
- Score: 11.295757620340899
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite empirical success, the theory of reinforcement learning (RL) with
value function approximation remains fundamentally incomplete. Prior work has
identified a variety of pathological behaviours that arise in RL algorithms
that combine approximate on-policy evaluation and greedification. One prominent
example is policy oscillation, wherein an algorithm may cycle indefinitely
between policies, rather than converging to a fixed point. What is not well
understood however is the quality of the policies in the region of oscillation.
In this paper we present simple examples illustrating that in addition to
policy oscillation and multiple fixed points -- the same basic issue can lead
to convergence to the worst possible policy for a given approximation. Such
behaviours can arise when algorithms optimize evaluation accuracy weighted by
the distribution of states that occur under the current policy, but greedify
based on the value of states which are rare or nonexistent under this
distribution. This means the values used for greedification are unreliable and
can steer the policy in undesirable directions. Our observation that this can
lead to the worst possible policy shows that in a general sense such algorithms
are unreliable. The existence of such examples helps to narrow the kind of
theoretical guarantees that are possible and the kind of algorithmic ideas that
are likely to be helpful. We demonstrate analytically and experimentally that
such pathological behaviours can impact a wide range of RL and dynamic
programming algorithms; such behaviours can arise both with and without
bootstrapping, and with linear function approximation as well as with more
complex parameterized functions like neural networks.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Chaining Value Functions for Off-Policy Learning [22.54793586116019]
We discuss a novel family of off-policy prediction algorithms which are convergent by construction.
We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix.
Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.
arXiv Detail & Related papers (2022-01-17T15:26:47Z) - Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm [16.115903198836694]
Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL)
This is known as "off-policy control" in RL where an agent's objective is to compute an optimal policy based on the data obtained from the given policy (known as the behavior policy)
This work proposes an off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency.
arXiv Detail & Related papers (2021-10-19T14:36:45Z) - The Role of Lookahead and Approximate Policy Evaluation in Policy
Iteration with Linear Value Function Approximation [14.528756508275622]
We show that when linear function approximation is used to represent the value function, a certain minimum amount of lookahead and multi-step return is needed.
And when this condition is met, we characterize the finite-time performance of policies obtained using such approximate policy.
arXiv Detail & Related papers (2021-09-28T01:20:08Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z) - A Nonparametric Off-Policy Policy Gradient [32.35604597324448]
Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes.
We build on the general sample efficiency of off-policy algorithms.
We show that our approach has better sample efficiency than state-of-the-art policy gradient methods.
arXiv Detail & Related papers (2020-01-08T10:13:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.