Related papers: SQT -- std $Q$-target

SQT -- std $Q$-target

URL: http://arxiv.org/abs/2402.05950v3
Date: Sun, 2 Jun 2024 19:39:44 GMT
Title: SQT -- std $Q$-target
Authors: Nitsan Soffair, Dotan Di-Castro, Orly Avner, Shie Mannor,
Abstract summary: Std $Q$-target is a conservative, actor-critic, ensemble, $Q$-learning-based algorithm. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms. Our results demonstrate SQT's $Q$-target formula superiority over TD3's $Q$-target formula as a conservative solution to overestimation bias in RL.
Score: 47.3621151424817
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Std $Q$-target is a conservative, actor-critic, ensemble, $Q$-learning-based algorithm, which is based on a single key $Q$-formula: $Q$-networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of overestimation bias. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3 and TD7 on seven popular MuJoCo and Bullet tasks. Our results demonstrate SQT's $Q$-target formula superiority over TD3's $Q$-target formula as a conservative solution to overestimation bias in RL, while SQT shows a clear performance advantage on a wide margin over DDPG, TD3, and TD7 on all tasks.

Related papers

Transfer Q Star: Principled Decoding for LLM Alignment [105.89114186982972]
Transfer $Q*$ estimates the optimal value function for a target reward $r$ through a baseline model. Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods.
arXiv Detail & Related papers (2024-05-30T21:36:12Z)
Conservative DDPG -- Pessimistic RL without Ensemble [48.61228614796803]
DDPG is hindered by the overestimation bias problem. Traditional solutions to this bias involve ensemble-based methods. We propose a straightforward solution using a $Q$-target and incorporating a behavioral cloning (BC) loss penalty.
arXiv Detail & Related papers (2024-03-08T23:59:38Z)
MinMaxMin $Q$-learning [48.61228614796803]
MinMaxMin $Q$-learning is a novel optimistic Actor-Critic algorithm that addresses the problem of overestimation bias. We implement MinMaxMin on top of TD3 and TD7, subjecting it to rigorous testing against state-of-the-art continuous-space algorithms. The results show a consistent performance improvement of MinMaxMin over DDPG, TD3, and TD7 across all tested tasks.
arXiv Detail & Related papers (2024-02-03T21:58:06Z)
DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs [9.561022942046279]
We propose Divide and Conquer Reasoning (DCR) to enhance the reasoning capability of large language models (LLMs) We first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers. In particular, we first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
arXiv Detail & Related papers (2024-01-10T14:38:46Z)
Scalable implementation of $(d+1)$ mutually unbiased bases for $d$-dimensional quantum key distribution [0.0]
A high-dimensional quantum key distribution (QKD) can improve error rate tolerance and the secret key rate. Many $d$-dimensional QKDs have used two mutually unbiased bases (MUBs) We propose a scalable and general implementation of $(d+1)$ MUBs using $log_p d$ interferometers in prime power dimensions.
arXiv Detail & Related papers (2022-04-06T09:39:55Z)
Polyak-Ruppert Averaged Q-Leaning is Statistically Efficient [90.14768299744792]
We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-leaning) in a $gamma$-discounted MDP. We establish normality for the iteration averaged $barboldsymbolQ_T$. In short, our theoretical analysis shows averaged Q-Leaning is statistically efficient.
arXiv Detail & Related papers (2021-12-29T14:47:56Z)
Minimal Expected Regret in Linear Quadratic Control [79.81807680370677]
We devise an online learning algorithm and provide guarantees on its expected regret. This regret at time $T$ is upper bounded (i) by $widetildeO((d_u+d_x)sqrtd_xT)$ when $A$ and $B$ are unknown.
arXiv Detail & Related papers (2021-09-29T14:07:21Z)
A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes [13.877420496703627]
This paper presents the first em model-free, em simulator-free reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The algorithm is named Triple-Q because it has three key components: a Q-function for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation.
arXiv Detail & Related papers (2021-06-03T03:53:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.