CASA-B: A Unified Framework of Model-Free Reinforcement Learning
- URL: http://arxiv.org/abs/2105.03923v1
- Date: Sun, 9 May 2021 12:45:13 GMT
- Title: CASA-B: A Unified Framework of Model-Free Reinforcement Learning
- Authors: Changnan Xiao, Haosen Shi, Jiajun Fan, Shihong Deng
- Abstract summary: CASA-B is an actor-critic framework that estimates state-value, state-action-value and policy.
We prove that CASA-B integrates a consistent path for the policy evaluation and the policy improvement.
We propose a progressive closed-form entropy control mechanism, which explicitly controls the behavior policies' entropy to arbitrary range.
- Score: 1.4566990078034239
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building on the breakthrough of reinforcement learning, this paper introduces
a unified framework of model-free reinforcement learning, CASA-B, Critic AS an
Actor with Bandits Vote Algorithm. CASA-B is an actor-critic framework that
estimates state-value, state-action-value and policy. An expectation-correct
Doubly Robust Trace is introduced to learn state-value and state-action-value,
whose convergence properties are guaranteed. We prove that CASA-B integrates a
consistent path for the policy evaluation and the policy improvement. The
policy evaluation is equivalent to a compensational policy improvement, which
alleviates the function approximation error, and is also equivalent to an
entropy-regularized policy improvement, which prevents the policy from
collapsing to a suboptimal solution. Building on this design, we find the
entropy of the behavior policies' and the target policy's are disentangled.
Based on this observation, we propose a progressive closed-form entropy control
mechanism, which explicitly controls the behavior policies' entropy to
arbitrary range. Our experiments show that CASAB is super sample efficient and
achieves State-Of-The-Art on Arcade Learning Environment. Our mean Human
Normalized Score is 6456.63% and our median Human Normalized Score is 477.17%,
under 200M training scale.
Related papers
- Actor-Critic Reinforcement Learning with Phased Actor [10.577516871906816]
We propose a novel phased actor in actor-critic (PAAC) method to improve policy gradient estimation.
PAAC accounts for both $Q$ value and TD error in its actor update.
Results show that PAAC leads to significant performance improvement measured by total cost, learning variance, robustness, learning speed and success rate.
arXiv Detail & Related papers (2024-04-18T01:27:31Z) - Matrix Estimation for Offline Reinforcement Learning with Low-Rank
Structure [10.968373699696455]
We consider offline Reinforcement Learning (RL), where the agent does not interact with the environment and must rely on offline data collected using a behavior policy.
Previous works provide policy evaluation guarantees when the target policy to be evaluated is covered by the behavior policy.
We propose an offline policy evaluation algorithm that leverages the low-rank structure to estimate the values of uncovered state-action pairs.
arXiv Detail & Related papers (2023-05-24T23:49:06Z) - Hallucinated Adversarial Control for Conservative Offline Policy
Evaluation [64.94009515033984]
We study the problem of conservative off-policy evaluation (COPE) where given an offline dataset of environment interactions, we seek to obtain a (tight) lower bound on a policy's performance.
We introduce HAMBO, which builds on an uncertainty-aware learned model of the transition dynamics.
We prove that the resulting COPE estimates are valid lower bounds, and, under regularity conditions, show their convergence to the true expected return.
arXiv Detail & Related papers (2023-03-02T08:57:35Z) - Conservative State Value Estimation for Offline Reinforcement Learning [36.416504941791224]
Conservative State Value Estimation (CSVE) learns conservative V-function via directly imposing penalty on OOD states.
We develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states empharound the dataset.
We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.
arXiv Detail & Related papers (2023-02-14T08:13:55Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Ensuring Monotonic Policy Improvement in Entropy-regularized Value-based
Reinforcement Learning [14.325835899564664]
entropy-regularized value-based reinforcement learning method can ensure the monotonic improvement of policies at each policy update.
We propose a novel reinforcement learning algorithm that exploits this lower-bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation.
arXiv Detail & Related papers (2020-08-25T04:09:18Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z) - Efficient Evaluation of Natural Stochastic Policies in Offline
Reinforcement Learning [80.42316902296832]
We study the efficient off-policy evaluation of natural policies, which are defined in terms of deviations from the behavior policy.
This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies.
arXiv Detail & Related papers (2020-06-06T15:08:24Z) - Policy Entropy for Out-of-Distribution Classification [8.747840760772268]
We propose PEOC, a new policy entropy based out-of-distribution classifier.
It reliably detects unencountered states in deep reinforcement learning.
It is highly competitive against state-of-the-art one-class classification algorithms.
arXiv Detail & Related papers (2020-05-25T12:18:20Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.