SQT -- std $Q$-target
- URL: http://arxiv.org/abs/2402.05950v3
- Date: Sun, 2 Jun 2024 19:39:44 GMT
- Title: SQT -- std $Q$-target
- Authors: Nitsan Soffair, Dotan Di-Castro, Orly Avner, Shie Mannor,
- Abstract summary: Std $Q$-target is a conservative, actor-critic, ensemble, $Q$-learning-based algorithm.
We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms.
Our results demonstrate SQT's $Q$-target formula superiority over TD3's $Q$-target formula as a conservative solution to overestimation bias in RL.
- Score: 47.3621151424817
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Std $Q$-target is a conservative, actor-critic, ensemble, $Q$-learning-based algorithm, which is based on a single key $Q$-formula: $Q$-networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of overestimation bias. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3 and TD7 on seven popular MuJoCo and Bullet tasks. Our results demonstrate SQT's $Q$-target formula superiority over TD3's $Q$-target formula as a conservative solution to overestimation bias in RL, while SQT shows a clear performance advantage on a wide margin over DDPG, TD3, and TD7 on all tasks.
Related papers
- Transfer Q Star: Principled Decoding for LLM Alignment [105.89114186982972]
Transfer $Q*$ estimates the optimal value function for a target reward $r$ through a baseline model.
Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods.
arXiv Detail & Related papers (2024-05-30T21:36:12Z) - Conservative DDPG -- Pessimistic RL without Ensemble [48.61228614796803]
DDPG is hindered by the overestimation bias problem.
Traditional solutions to this bias involve ensemble-based methods.
We propose a straightforward solution using a $Q$-target and incorporating a behavioral cloning (BC) loss penalty.
arXiv Detail & Related papers (2024-03-08T23:59:38Z) - MinMaxMin $Q$-learning [48.61228614796803]
MinMaxMin $Q$-learning is a novel optimistic Actor-Critic algorithm that addresses the problem of overestimation bias.
We implement MinMaxMin on top of TD3 and TD7, subjecting it to rigorous testing against state-of-the-art continuous-space algorithms.
The results show a consistent performance improvement of MinMaxMin over DDPG, TD3, and TD7 across all tested tasks.
arXiv Detail & Related papers (2024-02-03T21:58:06Z) - DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs [9.561022942046279]
We propose Divide and Conquer Reasoning (DCR) to enhance the reasoning capability of large language models (LLMs)
We first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
In particular, we first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
arXiv Detail & Related papers (2024-01-10T14:38:46Z) - Scalable implementation of $(d+1)$ mutually unbiased bases for
$d$-dimensional quantum key distribution [0.0]
A high-dimensional quantum key distribution (QKD) can improve error rate tolerance and the secret key rate.
Many $d$-dimensional QKDs have used two mutually unbiased bases (MUBs)
We propose a scalable and general implementation of $(d+1)$ MUBs using $log_p d$ interferometers in prime power dimensions.
arXiv Detail & Related papers (2022-04-06T09:39:55Z) - Polyak-Ruppert Averaged Q-Leaning is Statistically Efficient [90.14768299744792]
We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-leaning) in a $gamma$-discounted MDP.
We establish normality for the iteration averaged $barboldsymbolQ_T$.
In short, our theoretical analysis shows averaged Q-Leaning is statistically efficient.
arXiv Detail & Related papers (2021-12-29T14:47:56Z) - Minimal Expected Regret in Linear Quadratic Control [79.81807680370677]
We devise an online learning algorithm and provide guarantees on its expected regret.
This regret at time $T$ is upper bounded (i) by $widetildeO((d_u+d_x)sqrtd_xT)$ when $A$ and $B$ are unknown.
arXiv Detail & Related papers (2021-09-29T14:07:21Z) - A Provably-Efficient Model-Free Algorithm for Constrained Markov
Decision Processes [13.877420496703627]
This paper presents the first em model-free, em simulator-free reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation.
The algorithm is named Triple-Q because it has three key components: a Q-function for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation.
arXiv Detail & Related papers (2021-06-03T03:53:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.