Inverse Policy Evaluation for Value-based Sequential Decision-making
- URL: http://arxiv.org/abs/2008.11329v1
- Date: Wed, 26 Aug 2020 01:31:38 GMT
- Title: Inverse Policy Evaluation for Value-based Sequential Decision-making
- Authors: Alan Chan, Kris de Asis, Richard S. Sutton
- Abstract summary: Value-based methods for reinforcement learning lack generally applicable ways to derive behavior from a value function.
We show that inverse policy evaluation, combined with an approximate value iteration algorithm, is a feasible method for value-based control.
- Score: 10.188967035477217
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Value-based methods for reinforcement learning lack generally applicable ways
to derive behavior from a value function. Many approaches involve approximate
value iteration (e.g., $Q$-learning), and acting greedily with respect to the
estimates with an arbitrary degree of entropy to ensure that the state-space is
sufficiently explored. Behavior based on explicit greedification assumes that
the values reflect those of \textit{some} policy, over which the greedy policy
will be an improvement. However, value-iteration can produce value functions
that do not correspond to \textit{any} policy. This is especially relevant in
the function-approximation regime, when the true value function can't be
perfectly represented. In this work, we explore the use of \textit{inverse
policy evaluation}, the process of solving for a likely policy given a value
function, for deriving behavior from a value function. We provide theoretical
and empirical results to show that inverse policy evaluation, combined with an
approximate value iteration algorithm, is a feasible method for value-based
control.
Related papers
- Stable Offline Value Function Learning with Bisimulation-based Representations [13.013000247825248]
In reinforcement learning, offline value function learning is used to estimate the expected discounted return from each state when taking actions according to a fixed target policy.
It is critical to stabilize value function learning by explicitly shaping the state-action representations.
We introduce a bisimulation-based algorithm called kernel representations for offline policy evaluation (KROPE)
arXiv Detail & Related papers (2024-10-02T15:13:25Z) - Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs [44.69257217086967]
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives.
In this paper, we address the learning problem given linear function approximation with $q_pi$-realizability.
arXiv Detail & Related papers (2024-06-26T17:57:13Z) - Confidence-Conditioned Value Functions for Offline Reinforcement
Learning [86.59173545987984]
We propose a new form of Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability.
We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence.
arXiv Detail & Related papers (2022-12-08T23:56:47Z) - General Policy Evaluation and Improvement by Learning to Identify Few
But Crucial States [12.059140532198064]
Learning to evaluate and improve policies is a core problem of Reinforcement Learning.
A recently explored competitive alternative is to learn a single value function for many policies.
We show that our value function trained to evaluate NN policies is also invariant to changes of the policy architecture.
arXiv Detail & Related papers (2022-07-04T16:34:53Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Provable Benefits of Actor-Critic Methods for Offline Reinforcement
Learning [85.50033812217254]
Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically.
We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle.
arXiv Detail & Related papers (2021-08-19T17:27:29Z) - Understanding the Pathologies of Approximate Policy Evaluation when
Combined with Greedification in Reinforcement Learning [11.295757620340899]
Theory of reinforcement learning with value function approximation remains fundamentally incomplete.
Prior work has identified a variety of pathological behaviours that arise in RL algorithms that combine approximate on-policy evaluation and greedification.
We present examples illustrating that in addition to policy oscillation and multiple fixed points -- the same basic issue can lead to convergence to the worst possible policy for a given approximation.
arXiv Detail & Related papers (2020-10-28T22:57:57Z) - Approximation Benefits of Policy Gradient Methods with Aggregated States [8.348171150908724]
Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration.
This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by $epsilon$.
arXiv Detail & Related papers (2020-07-22T21:20:24Z) - Efficient Evaluation of Natural Stochastic Policies in Offline
Reinforcement Learning [80.42316902296832]
We study the efficient off-policy evaluation of natural policies, which are defined in terms of deviations from the behavior policy.
This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies.
arXiv Detail & Related papers (2020-06-06T15:08:24Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.