Related papers: Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics

Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics

URL: http://arxiv.org/abs/2005.04269v1
Date: Fri, 8 May 2020 19:52:26 GMT
Title: Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics
Authors: Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, Dmitry Vetrov
Abstract summary: Overestimation bias is one of the major impediments to accurate off-policy learning. This paper investigates a novel way to alleviate the overestimation bias in a continuous control setting. Our method---Truncated Quantile Critics, TQC,---blends three ideas: distributional representation of a critic, truncation of critics prediction, and ensembling of multiple critics.
Score: 65.51757376525798
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The overestimation bias is one of the major impediments to accurate off-policy learning. This paper investigates a novel way to alleviate the overestimation bias in a continuous control setting. Our method---Truncated Quantile Critics, TQC,---blends three ideas: distributional representation of a critic, truncation of critics prediction, and ensembling of multiple critics. Distributional representation and truncation allow for arbitrary granular overestimation control, while ensembling provides additional score improvements. TQC outperforms the current state of the art on all environments from the continuous control benchmark suite, demonstrating 25% improvement on the most challenging Humanoid environment.

Related papers

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards [72.44810390478229]
CompassJudger-2 is a novel generalist judge model that overcomes limitations via a task-driven, multi-domain data curation strategy.<n> CompassJudger-2 achieves superior results across multiple judge and reward benchmarks.
arXiv Detail & Related papers (2025-07-12T01:34:24Z)
Training Language Model to Critique for Better Refinement [58.73039433159486]
We introduce textbfRefinement-oriented textbfCritique textbfOptimization (RCO), a novel framework designed to train critic models using refinement signals.<n>RCO uses a feedback loop where critiques, generated by the critic model, guide the actor model in refining its responses.<n>By focusing on critiques that lead to better refinements, RCO eliminates the need for direct critique preference assessment.
arXiv Detail & Related papers (2025-06-27T12:10:57Z)
Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge [90.8674158031845]
We propose Crowd-based Comparative Evaluation, which introduces additional crowd responses to compare with the candidate responses. This process effectively guides LLM-as-a-Judge to provide a more detailed chain-of-thought (CoT) judgment. Our method produces higher-quality CoTs that facilitate judge distillation and exhibit superior performance in rejection sampling.
arXiv Detail & Related papers (2025-02-18T03:31:06Z)
Enabling Scalable Oversight via Self-Evolving Critic [59.861013614500024]
SCRIT (Self-evolving CRITic) is a framework that enables genuine self-evolution of critique abilities. It self-improves by training on synthetic data, generated by a contrastive-based self-critic. It achieves up to a 10.3% improvement on critique-correction and error identification benchmarks.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning [112.35483894933904]
We propose VISCO, the first benchmark to extensively analyze the fine-grained critique and correction capabilities of LVLMs. VISCO features dense and fine-grained critique, requiring LVLMs to evaluate the correctness of each step in the chain-of-thought. LookBack significantly improves critique and correction performance by up to 13.5%.
arXiv Detail & Related papers (2024-12-03T05:04:49Z)
Contrastive CFG: Improving CFG in Diffusion Models by Contrasting Positive and Negative Concepts [55.298031232672734]
As-Free Guidance (CFG) has proven effective in conditional diffusion model sampling for improved condition alignment. We present a novel method to enhance negative CFG guidance using contrastive loss.
arXiv Detail & Related papers (2024-11-26T03:29:27Z)
On Centralized Critics in Multi-Agent Reinforcement Learning [16.361249170514828]
Training for Decentralized Execution has become a popular approach in Multi-Agent Reinforcement Learning. We analyze the effect of using state-based critics in partially observable environments.
arXiv Detail & Related papers (2024-08-26T19:27:06Z)
A Deeper Understanding of State-Based Critics in Multi-Agent Reinforcement Learning [17.36759906285316]
We show that state-based critics can introduce bias in the policy estimates, potentially undermining the guarantees of the algorithm. We also show that, even if the state-based critics do not introduce any bias, they can still result in a larger gradient variance, contrary to the common intuition.
arXiv Detail & Related papers (2022-01-03T14:51:30Z)
Assessment of Treatment Effect Estimators for Heavy-Tailed Data [70.72363097550483]
A central obstacle in the objective assessment of treatment effect (TE) estimators in randomized control trials (RCTs) is the lack of ground truth (or validation set) to test their performance. We provide a novel cross-validation-like methodology to address this challenge. We evaluate our methodology across 709 RCTs implemented in the Amazon supply chain.
arXiv Detail & Related papers (2021-12-14T17:53:01Z)
Automating Control of Overestimation Bias for Continuous Reinforcement Learning [65.63607016094305]
We present a data-driven approach for guiding bias correction. We demonstrate its effectiveness on the Truncated Quantile Critics -- a state-of-the-art continuous control algorithm.
arXiv Detail & Related papers (2021-10-26T09:27:12Z)
Parameter-Free Deterministic Reduction of the Estimation Bias in Continuous Control [0.0]
We introduce a parameter-free, novel deep Q-learning variant to reduce this underestimation bias for continuous control. We test the performance of our improvement on a set of MuJoCo and Box2D continuous control tasks.
arXiv Detail & Related papers (2021-09-24T07:41:07Z)
Estimation Error Correction in Deep Reinforcement Learning for Deterministic Actor-Critic Methods [0.0]
In value-based deep reinforcement learning methods, approximation of value functions induces overestimation bias and leads to suboptimal policies. We show that in deep actor-critic methods that aim to overcome the overestimation bias, if the reinforcement signals received by the agent have a high variance, a significant underestimation bias arises. To minimize the underestimation, we introduce a parameter-free, novel deep Q-learning variant.
arXiv Detail & Related papers (2021-09-22T13:49:35Z)
Efficient Continuous Control with Double Actors and Regularized Critics [7.072664211491016]
We explore the potential of double actors, which has been neglected for a long time, for better value function estimation in continuous setting. We build double actors upon single critic and double critics to handle overestimation bias in DDPG and underestimation bias in TD3 respectively. To mitigate the uncertainty of value estimate from double critics, we propose to regularize the critic networks under double actors architecture.
arXiv Detail & Related papers (2021-06-06T07:04:48Z)
Re-Assessing the "Classify and Count" Quantification Method [88.60021378715636]
"Classify and Count" (CC) is often a biased estimator. Previous works have failed to use properly optimised versions of CC. We argue that, while still inferior to some cutting-edge methods, they deliver near-state-of-the-art accuracy.
arXiv Detail & Related papers (2020-11-04T21:47:39Z)
Prediction with Corrupted Expert Advice [67.67399390910381]
We prove that a variant of the classical Multiplicative Weights algorithm with decreasing step sizes achieves constant regret in a benign environment. Our results reveal a surprising disparity between the often comparable Follow the Regularized Leader (FTRL) and Online Mirror Descent (OMD) frameworks.
arXiv Detail & Related papers (2020-02-24T14:39:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.