Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates
- URL: http://arxiv.org/abs/2110.14818v1
- Date: Thu, 28 Oct 2021 00:07:19 GMT
- Title: Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates
- Authors: Litian Liang, Yaosheng Xu, Stephen McAleer, Dailin Hu, Alexander
Ihler, Pieter Abbeel, Roy Fox
- Abstract summary: Q-Learning has proven effective at learning a policy to perform control tasks.
estimation noise becomes a bias after the max operator in the policy improvement step.
We present Unbiased Soft Q-Learning (UQL), which extends the work of EQL from two action, finite state spaces to multi-action, infinite state Markov Decision Processes.
- Score: 110.92598350897192
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal-Difference (TD) learning methods, such as Q-Learning, have proven
effective at learning a policy to perform control tasks. One issue with methods
like Q-Learning is that the value update introduces bias when predicting the TD
target of a unfamiliar state. Estimation noise becomes a bias after the max
operator in the policy improvement step, and carries over to value estimations
of other states, causing Q-Learning to overestimate the Q value. Algorithms
like Soft Q-Learning (SQL) introduce the notion of a soft-greedy policy, which
reduces the estimation bias via soft updates in early stages of training.
However, the inverse temperature $\beta$ that controls the softness of an
update is usually set by a hand-designed heuristic, which can be inaccurate at
capturing the uncertainty in the target estimate. Under the belief that $\beta$
is closely related to the (state dependent) model uncertainty, Entropy
Regularized Q-Learning (EQL) further introduces a principled scheduling of
$\beta$ by maintaining a collection of the model parameters that characterizes
model uncertainty. In this paper, we present Unbiased Soft Q-Learning (UQL),
which extends the work of EQL from two action, finite state spaces to
multi-action, infinite state space Markov Decision Processes. We also provide a
principled numerical scheduling of $\beta$, extended from SQL and using model
uncertainty, during the optimization process. We show the theoretical
guarantees and the effectiveness of this update method in experiments on
several discrete control environments.
Related papers
- Time-Scale Separation in Q-Learning: Extending TD($\triangle$) for Action-Value Function Decomposition [0.0]
This paper introduces Q($Delta$)-Learning, an extension of TD($Delta$) for the Q-Learning framework.
TD($Delta$) facilitates efficient learning over several time scales by breaking the Q($Delta$)-function into distinct discount factors.
We demonstrate through theoretical analysis and practical evaluations on standard benchmarks like Atari that Q($Delta$)-Learning surpasses conventional Q-Learning and TD learning methods.
arXiv Detail & Related papers (2024-11-21T11:03:07Z) - Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems [10.404992912881601]
We study reinforcement learning for a class of continuous-time linear-quadratic (LQ) control problems for diffusions.
We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an actor-critic algorithm to learn the optimal policy parameter directly.
arXiv Detail & Related papers (2024-07-24T12:26:21Z) - Regularized Q-learning through Robust Averaging [3.4354636842203026]
We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner.
One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance.
We show that 2RA Q-learning converges to the optimal policy and analyze its theoretical mean-squared error.
arXiv Detail & Related papers (2024-05-03T15:57:26Z) - A Perspective of Q-value Estimation on Offline-to-Online Reinforcement
Learning [54.48409201256968]
offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples.
Most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples.
arXiv Detail & Related papers (2023-12-12T19:24:35Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Offline RL with No OOD Actions: In-Sample Learning via Implicit Value
Regularization [90.9780151608281]
In-sample learning (IQL) improves the policy by quantile regression using only data samples.
We make a key finding that the in-sample learning paradigm arises under the textitImplicit Value Regularization (IVR) framework.
We propose two practical algorithms, Sparse $Q$-learning (EQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works.
arXiv Detail & Related papers (2023-03-28T08:30:01Z) - Control-Tutored Reinforcement Learning: Towards the Integration of
Data-Driven and Model-Based Control [0.0]
We present an architecture where a feedback controller derived on an approximate model of the environment assists the learning process to enhance its data efficiency.
This architecture, which we term as Control-Tutored Q-learning (CTQL), is presented in two alternative flavours.
The former is based on defining the reward function so that a Boolean condition can be used to determine when the control tutor policy is adopted.
The latter, termed as probabilistic CTQL (pCTQL), is instead based on executing calls to the tutor with a certain probability during learning.
arXiv Detail & Related papers (2021-12-11T16:34:36Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Estimation Error Correction in Deep Reinforcement Learning for
Deterministic Actor-Critic Methods [0.0]
In value-based deep reinforcement learning methods, approximation of value functions induces overestimation bias and leads to suboptimal policies.
We show that in deep actor-critic methods that aim to overcome the overestimation bias, if the reinforcement signals received by the agent have a high variance, a significant underestimation bias arises.
To minimize the underestimation, we introduce a parameter-free, novel deep Q-learning variant.
arXiv Detail & Related papers (2021-09-22T13:49:35Z) - Task-Specific Normalization for Continual Learning of Blind Image
Quality Models [105.03239956378465]
We present a simple yet effective continual learning method for blind image quality assessment (BIQA)
The key step in our approach is to freeze all convolution filters of a pre-trained deep neural network (DNN) for an explicit promise of stability.
We assign each new IQA dataset (i.e., task) a prediction head, and load the corresponding normalization parameters to produce a quality score.
The final quality estimate is computed by black a weighted summation of predictions from all heads with a lightweight $K$-means gating mechanism.
arXiv Detail & Related papers (2021-07-28T15:21:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.